Software transactional memory offers an appealing alternative to locks by improving programmability, reliability, and scalability. However, existing STMs are impractical because they add high instrumentation costs and often provide weak progress guarantees and/or semantics.
While scientific programs have been parallel for decades, generalpurpose software must become more parallel to scale with successive hardware generations that provide more-instead of fastercores. However, it is notoriously challenging to write lock-based, shared-memory parallel programs that are correct and scalable.
An appealing alternative to lock-based synchronization is transactional memory (TM) [25, 31] . In the TM model, programs specify atomic regions of code, which the system executes speculatively as transactions. To ensure serializability, the system detects conflicting transactions, rolls back their state, and re-executes them.
TM is not a panacea. It does not help if atomicity is specified incorrectly or too conservatively; it does not help with specifying ordering constraints; and it does not handle irrevocable operations such as I/O well. However, TM has significant potential to improve productivity, reliability, and scalability by allowing programmers to specify atomicity with the ease of coarse-grained locks while providing the scalability of fine-grained locks [42] . TM also enables runtime system support, e.g., for speculative optimization [40] .
Despite these potential benefits, TM is not widely used. Recent HTM support is limited, still relying on efficient software TM (STM) support (Section 2.1). Existing STMs are impractical because they add high overhead-making it hard to achieve good performance even if STM scales well-and also often provide weak guarantees. These drawbacks have led some researchers to question the viability of STM and call it a "research toy" [11, 20, 59] . This paper introduces a novel STM called LarkTM that provides very low instrumentation costs. At the same time, its design naturally guarantees progress and strong semantics. Three key features distinguish LarkTM from existing STMs. First, it uses biased per-object, reader-writer locks [6, 33] , which a thread relinquishes only when needed by another thread performing a conflicting access-making non-conflicting accesses fast but requiring threads to coordinate when accesses conflict. Second, LarkTM detects and resolves transactional conflicts (conflicts between transactions or between a transaction and non-transactional access) when threads coordinate, enabling flexible conflict resolution that guarantees progress. Third, LarkTM provides strong atomicity semantics with low overhead by acquiring its low-overhead locks at both transactional and non-transactional accesses.
This basic approach, which we call LarkTM-O, adds low singlethread overhead and scales well under low contention. But scalability suffers under higher contention due to the high cost of threads coordinating. We design an adaptive version of LarkTM called LarkTM-S that handles high-contention accesses, identified by profiling, using different concurrency control mechanisms.
We have implemented LarkTM-O and LarkTM-S in a highperformance Java virtual machine. We have also implemented two STMs from prior work, NOrec [15] and an STM we call Intel-STM [49] , and compare them against LarkTM-O and LarkTM-S.
We evaluate overhead and scalability on a Java port of the transactional STAMP benchmarks [10] . The evaluation focuses on 1-8 threads because all STMs that we evaluate provide almost no scalability benefit for more threads, due to scalability limitations of STAMP and our parallel platform. LarkTM-O and LarkTM-S add significantly lower single-thread overhead (slowdowns of 1.40X Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. Copyright is held by the author/owner(s). Publication rights licensed to ACM. and 1.73X, respectively) than NOrec and IntelSTM (2.88X and 3.32X, respectively).
LarkTM-O's scalability suffers due to the high cost of threads coordinating at conflicts, but LarkTM-S scales well and provides the best overall performance. For 8 application threads, LarkTM-O and LarkTM-S execute the TM programs 1.09X and 1.72X faster than NOrec, and 1.27X and 2.01X faster than IntelSTM. Contributions. This paper makes several contributions:
• a novel STM called LarkTM that (i) adds low overhead by making non-conflicting accesses fast, (ii) provides strong progress guarantees, and (iii) supports strong semantics efficiently; • a novel approach for integrating LarkTM's concurrency control mechanism with an existing STM concurrency control mechanism that has different tradeoffs, yielding basic and adaptive STM versions (LarkTM-O and LarkTM-S); • implementations of (i) LarkTM-O and LarkTM-S and (ii) two high-performance STMs from prior work; and • an evaluation on transactional benchmarks that shows that Lark-TM-O and LarkTM-S achieve low overhead and good scalability, thus outperforming existing high-performance STMs.
Background, Motivation, and Related Work
Commodity hardware TM (HTM) requires a software TM (STM) fallback. But existing STMs incur high overhead in order to detect and resolve conflicts, and often provide weak progress guarantees and/or weak semantics.
HTM Is Limited and Needs STM
HTM detects and resolves conflicts by piggybacking on cache coherence protocols and provides versioning by extending caches (e.g., [24, 31, 38] ). Recently, Intel's Transactional Synchronization Extensions (TSX) and IBM's Blue Gene/Q provide HTM support [56, 58] . However, this hardware support is limited: it does not guarantee completion of any transaction. In order to provide language-level support for atomic blocks, limited HTM relies on STM to execute transactions that the hardware fails to commit. Prior work on hybrid software-hardware TM has concluded that efficient STM is essential for good overall performance [5] . Furthermore, limited HTM support does not necessarily offer the best performance for short transactions. Recent evaluations of Intel TSX show that the set-up and tear-down costs of a transaction are about the same as three atomic operations (e.g., compareand-swap instructions) [43, 58] . Our LarkTM, which avoids atomic operations altogether, may thus perform competitively with current limited HTM for short, low-contention transactions-but a comparison is beyond the scope of this paper.
Concurrency Control
A key activity of STMs is performing concurrency control: detecting and resolving conflicts between transactions and (for strongly atomic STMs) between transactions and non-transactional accesses. STMs can perform concurrency control either eagerly (at the conflicting access) or lazily (typically at commit time).
A key cost of concurrency control is synchronization, typically in the form of atomic operations (e.g., compare-and-swap) on STM metadata. Eager concurrency control typically requires that STM instrumentation use synchronization at every program memory access. By instead using lazy concurrency control, STMs can avoid such frequent synchronization, although they often incur other costs as a result.
Recent high-performance STMs typically use lazy concurrency control [15, 18, 20, 21, 41, 52] (although SwissTM detects writewrite conflicts eagerly [20, 21] ). A high-performance STM that we implement and compare against is NOrec, which defers conflict detection until commit time [15] . NOrec uses a single global sequence lock to commit buffered stores safely. It logs each read's value, so it can validate at commit time that the value is unchanged. Lazy concurrency control incurs overhead to log and later validate reads, and to buffer and later commit writes (although prior work suggests these overheads can be minimized with engineering effort [15, 50] ).
Recent high-performance STMs have largely avoided using eager concurrency control for reads (so-called "visible readers"), since each read requires atomic operations on metadata (e.g., to add a reader to a reader-writer lock) [19] . A few STMs have used eager concurrency control for both reads and writes, which provides progress guarantees as we shall see, but adds substantial synchronization overhead [30, 35] .
Some STMs have used eager concurrency control for writes, but lazy concurrency control for reads (so-called "invisible reads") in order to avoid synchronization costs at reads [28, 45, 47, 49] . Notably, we implement and compare against an STM that we call IntelSTM, Shpeisman et al.'s strongly atomic version [49] of McRT-STM [45] . IntelSTM and other mixed-mode STMs detect writewrite and write-read conflicts eagerly but detect read-write conflicts lazily by logging reads and validating them later.
Progress Guarantees
STMs can suffer from livelock: two or more threads' transactions repeatedly cause each other to abort and retry. STMs that use lazy concurrency control for both reads and writes can help to guarantee freedom from livelock. For example, NOrec can always commit at least one transaction among a set of concurrent transactions [15] . (Lazy mechanisms provide two additional benefits in prior work. First, they help to provide sandboxing guarantees for unsafe languages such as C and C++ [13] . In contrast, our design targets safe languages and does not require sandboxing; Section 3.6. Second, for high-contention workloads, lazy concurrency control helps make contention management, i.e., choosing which conflicting transaction to abort, more effective by deferring decisions until commit time [50] .)
Although fully lazy STMs can help to guarantee livelock freedom, they cannot generally guarantee starvation freedom: not only will at least one thread's transaction eventually commit, but every thread's transaction will eventually commit. STMs that use eager concurrency control for both reads and writes, including our Lark-TM, can guarantee not only livelock freedom but also starvation freedom, as long as they provide support for aborting either thread involved in a conflict (since this flexibility enables age-based contention management; Section 3.4) [23] . (An interesting related design is InvalSTM, which uses fully lazy concurrency control and allows a thread to abort another thread's transaction [22] .)
In contrast, STMs such as IntelSTM that mix lazy and eager concurrency control struggle to guarantee livelock freedom: since any transaction that fails read validation must abort, all running transactions can repeatedly fail read validation and abort [23, 49] .
Transactional Semantics
Most STMs provide weak atomicity: transactions appear to execute atomically only with respect to other transactions, not nontransactional accesses. Researchers generally agree that weakly atomic STMs must provide at least single global lock atomicity (SLA) semantics [27, 37] (or a relaxed variant such as asymmetric lock atomicity [36] ). Under SLA, an execution behaves as though each transaction were replaced with a critical section acquiring the same global lock. SLA (and its variants, for the most part) provide safety for so-called privatization and publication patterns, which involve data-race-free conflicts between transactions and non-transactional accesses [1, 39, 49] .
To support SLA (or one of its variants), STMs often must compromise performance. For example, STMs can provide privatiza-tion safety using techniques that can hurt scalability [59] , such as by committing transactions in the same order that they started [36, 51, 57] , or by committing writes using a global lock [15] .
A stronger memory model than SLA is strong atomicity (also called strong isolation), which provides atomicity of transactions with respect to non-transactional accesses. Strong atomicity not only provides privatization and publication safety, but it executes each transaction atomically even if it races with non-transactional accesses. Strong atomicity enables programmers to reason locally about the semantics of atomic blocks, which is particularly useful when not all non-transactional code is fully understood, tested, or trusted (e.g., third-party libraries) [47] . Unintentional and intentional data races are common in (non-transactional) real-world software and lead to erroneous behaviors; Adve and Boehm have argued that racy programs need stronger behavior guarantees [3] . Furthermore, HTM naturally provides strong atomicity, making strongly atomic STM appealing for use in hybrid TM.
Some researchers have argued that despite these benefits, strong atomicity is not worth its costs in existing STMs [12, 14] . By providing strong atomicity naturally at low cost, this paper's STM offers a new data point to consider in the tradeoff between performance and semantics.
Prior work on strongly atomic STM. Prior work has sought to reduce strong atomicity's cost. Shpeisman et al. use whole-program static analysis and dynamic thread escape analysis to identify thread-local accesses that cannot conflict with a transaction and thus do not need expensive instrumentation [49] . That paper's evaluation reports relatively low overheads but uses the simple, mostly single-threaded SPECjvm98 benchmarks.
Schneider et al. and Bronson et al. reduce strong atomicity's cost by optimistically assuming that non-transactional accesses will not access transactional data, and recompiling accesses that violate this assumption [7, 47] . In a similar spirit, Abadi et al. use commodity hardware-based memory protection to handle strong atomicity conflicts [2] . Both approaches rely on non-transactional code almost never accessing memory accessed by transactions, or else the performance penalty is substantial.
Summary
STMs have struggled to provide good performance, as well as progress guarantees and strong semantics. High-performance STMs typically use lazy concurrency control for reads (to avoid high synchronization costs) combined with lazy concurrency control for writes (to guarantee progress). However, the resulting designs incur single-thread overhead and sometimes hurt scalability. Singlethread overhead is crucial because it is the starting point for multithreaded performance. Existing STMs' performance has been poor mainly due to high single-thread overhead [11, 59] .
Design
This section describes a novel STM called LarkTM. LarkTM uses instrumentation at reads and writes that adds low overhead compared to prior work. Furthermore, its design naturally supports strong progress guarantees and strong atomicity semantics.
LarkTM's concurrency control uses biased locks that make nonconflicting accesses fast, but incur significant costs for conflicting accesses. Section 3.6 describes a version of LarkTM that adaptively uses alternative concurrency control for high-conflict objects.
Biased Reader-Writer Locks
Existing STMs-whether they use lazy or eager concurrency control for writes-have generally avoided the high cost of eager concurrency control for reads (Section 2.2). Acquiring a reader lock requires an atomic operation that triggers extraneous remote cache misses at read-shared accesses. In contrast, LarkTM uses eager concurrency control for both reads and writes, by using so-called biased locks that avoid synchronization operations as much as possible [6, 33, 44, 46, 54] . LarkTM's biased reader-writer locks, which are based on prior work called Octet [6] , support concurrent readers efficiently, enabling multiple concurrent readers to an object without synchronization. Furthermore, the locks naturally support conflict resolution that allows either thread to abort.
Existing STMs typically have not employed biased locking. An exception is Hindman and Grossman's STM that uses biased locks for concurrency control [32] . However, its locks do not support concurrent readers, and its conflict resolution does not support either transaction aborting.
LarkTM assigns a biased reader-writer lock to each object (e.g., the lock can be a word added to the object's header). Unlike traditional locks, each biased lock is always "acquired" for reading or writing by one or more threads. Each lock has one of the following states at any given time: WrEx T (write exclusive for thread T), RdEx T (read exclusive for T), or RdSh (read shared). A newly allocated object's lock starts in WrEx T state (T is the allocating thread).
Instrumentation before each memory access performs a lock acquire operation to ensure the accessed object's lock is in a suitable state. Table 1 shows all possible state transitions for acquiring a lock, based on the access and the current state. In the common case, the lock's state does not need to change (e.g., a read or write by T to an object locked in WrEx T state). In other cases, the acquire operation upgrades the lock's state (e.g., from RdEx T1 to RdSh at a read by T2), using an atomic operation to avoid racing with another thread changing the state.
Otherwise, the lock's state conflicts with the pending access. Consider the following example, where a thread T2 performs a conflicting read to an object initially locked in WrEx T1 state:
T2 cannot simply change the lock's state to RdEx T2 because of the possibility that T1 will simultaneously and racily write to o, as the example shows. Among other issues, this race could lead to the transaction committing potentially unserializable results. Instead, each conflicting lock acquire must coordinate with thread(s) that hold the lock, to ensure they do not continue accessing the object racily. Coordination, described next, provides a natural opportunity to perform transactional conflict detection and conflict resolution.
Handling Lock Conflicts with Coordination
This section describes the coordination protocol that LarkTM uses to change a lock's state prior to a conflicting access. LarkTM extends prior work's coordination protocol [6] to perform conflict detection and resolution.
(a) Explicit protocol: (1) respT accessed an object o at some prior time. (2) Before a thread, called the requesting thread, reqT, can perform a conflicting lock acquire (last four rows of Table 1) , it must first coordinate with thread(s) that might otherwise continue accessing the object under the lock's old state. The thread(s) that can access the object under the lock's current state are the responding thread(s). The following explanation supposes the current state is WrEx respT or RdEx respT and thus a single responding thread respT. If the state is RdSh, reqT coordinates separately with every other thread.
Thread reqT initiates the coordination protocol by atomically changing the lock to a special intermediate state, Int reqT , which simplifies the protocol by ensuring that only one thread at a time is trying to change the object's lock's state. (Another thread that tries to acquire the same object's lock must wait for reqT to finish coordination and change the lock's state.) Then reqT sends a request to respT, and respT responds at a safe point: a program point that does not interrupt the atomicity of a lock acquire and its corresponding access. Safe points must occur periodically; language virtual machines typically already place yield points at every method entry and loop back edge, e.g., to enable timely yielding for stop-the-world garbage collection (GC). Furthermore, to avoid deadlock, any blocking operation (e.g., waiting to start GC, acquire a lock, or finish I/O) must act as a safe point. Depending on whether respT is executing normally or performing a blocking operation, reqT coordinates with respT either explicitly or implicitly.
Explicit protocol. If respT is not at a blocking safe point, reqT performs the explicit protocol as shown in Figure 1(a) . reqT requests a response from respT by adding itself to respT's request queue. respT handles the request at a safe point, by performing conflict detection and resolution (Sections 3.3-3.4) before responding to reqT. Once reqT receives the response, it ensures that respT will (a) (b) Figure 2 . A conflicting access is a necessary but insufficient condition for a transactional conflict. Solid boxes are transactions; dashed boxes could be either transactional or non-transactional.
"see" that the object's lock's state has changed. During the explicit protocol, while reqT waits for a response, it enters a "blocked" state so that it can act as a responding thread for other threads performing the implicit protocol, thus avoiding deadlock. Implicit protocol. If respT is at a blocking safe point, reqT performs the implicit protocol as shown in Figure 1 (b). reqT atomically "places a hold" on respT by putting it in a "blocked and held" state. Multiple threads can place a hold on respT, so the held state includes a counter. After reqT performs conflict detection and resolution (Sections 3.3-3.4), it removes the hold by decrementing respT's held counter. If respT finishes its blocking operation, it will wait for the held counter to reach zero before continuing execution, allowing reqT to read and potentially modify respT's state safely.
After either protocol completes, reqT changes the lock's state to the new state (WrEx reqT or RdEx reqT )-unless reqT aborts, in which case the protocol reverts the lock to its old state (Section 3.4). Active and passive threads. Note that depending on the protocol, either the requesting or responding thread performs transactional conflict detection and resolution. We refer to this thread as the active thread. The other thread is the passive thread.
Active thread Passive thread Explicit protocol Responding thread Requesting thread Implicit protocol Requesting thread
Responding thread
These assignments make sense as follows. In the explicit protocol, the requesting thread is stopped while the responding thread responds, so the responding thread can safely act on both threads. In the implicit protocol, the responding thread is blocked, so the requesting thread must do all of the work. Figure 2 shows how a conflicting access (a) may or (b) may not indicate a transactional conflict, depending on whether the responding thread's current transaction (if any) has accessed the object. To detect whether the responding thread has accessed the object, LarkTM maintains read/write sets. For an object locked in WrEx T or RdEx T state, LarkTM maintains the last transaction of T to access the object. For an object locked in RdSh state, LarkTM tracks whether each thread's current transaction has read the object.
Detecting Transactional Conflicts
When the active thread detects transactional conflicts, the coordination protocol's design ensures that the passive thread is stopped, so the active thread can safely read the passive thread's state. For each responding thread respT, the active thread detects transactional conflicts by using the read/write sets to identify the last transaction (if any) of respT to access the conflicting object. If this transaction is the same as respT's current transaction (if any), the active thread has identified a transactional conflict, so it triggers conflict resolution. Detecting conflicts at WrEx→RdEx. It is challenging to detect conflicts precisely at a read by reqT to an object whose lock is Figure 3 (b), upgrading the lock's state to WrEx reqT , conflict detection should report a conflict with respT. It is hard to detect this conflict at reqT's write, since o's prior access information has been lost (replaced by reqT). The same challenge exists regardless of whether reqT executes its read and write in or out of transactions.
One way to handle this case precisely is to transition a lock to RdSh in cases like reqT's read in Figures 3(a) and 3(b) , when respT's transaction has read but not written the object. This precise policy triggers a RdSh→WrEx reqT transition at reqT's write in Figure 3(b) , detecting the transactional conflict.
However, the precise policy can hurt performance by leading to more RdSh→WrEx transitions. LarkTM thus uses an imprecise policy: for a conflicting read (i.e., a read to an object locked in another thread's WrEx state), the active thread checks whether respT's transaction has performed writes or reads. Thus, in Figures 3(a) and 3(b), LarkTM detects a transactional conflict at reqT's conflicting read. We find that LarkTM's imprecise policy impacts transactional aborts insignificantly compared to the precise policy, except for the STAMP benchmark kmeans, for which the imprecise policy triggers 30% fewer aborts-but kmeans has a low abort rate to begin with, so its performance is unchanged. Overall, the precise policy hurts performance by leading to more RdSh→WrEx transitions.
We emphasize that LarkTM's imprecise policy for handling conflicting reads does not in general lead to concurrent reads generating false transactional conflicts. Rather, false conflicts occur only in cases like Figure 3(a) , where o's lock is in WrEx respT state because respT has previously written o, but respT's current transaction has only read, not written, o.
Resolving Transactional Conflicts
If an active thread detects a transactional conflict, it triggers conflict resolution, which resolves the conflict by aborting a transaction or retrying a non-transactional access. A key feature of LarkTM is that, by piggybacking on coordination, it can abort either conflicting thread, enabling flexible conflict resolution.
Contention management. When resolving a conflict, the active thread can abort either thread, providing flexibility for using various contention management policies [50] . LarkTM uses an agebased contention management policy [30] that chooses to abort whichever transaction or non-transactional access started more recently. This policy provides not only livelock freedom but also starvation freedom: each thread's transaction will eventually commit (a repeatedly aborting transaction will eventually be the oldest) [50] .
Aborting a thread. The aborting thread abortingT chosen by contention management may be executing a transaction or a nontransactional access's lock acquire. "Aborting" a non-transactional access means retrying its preceding lock acquire.
To ensure that only one thread at a time tries to roll back abortingT's stores, the active thread first acquires a lock for abortingT. Note that another thread otherT can initiate implicit coordination with abortingT while abortingT's stores are being rolled back. If otherT triggers coordination in order to access an object that is part of abortingT's speculative state, otherT will find the object locked in WrEx abortingT state, triggering conflict resolution, which will wait on abortingT's lock until rollback finishes.
In work tangentially related to piggybacking conflict resolution on coordination, Harris and Fraser present a technique that allows a thread to revoke a second thread's lock without blocking [26] .
Retrying transactions and non-transactional accesses. After the active thread rolls back the aborting thread's speculative stores, and the lock state change completes or reverts, both threads may continue. The aborting thread sees that it should abort, and it retries its current transaction or non-transactional access.
LarkTM's Instrumentation
The following pseudocode shows the instrumentation that LarkTM adds to every memory access to acquire a per-object reader-writer lock and perform other STM operations. At a program write: The fast-path check corresponds to the first three rows in the instrumentation adds the object access to the transaction's read-/write set. For an object locked in WrEx or RdEx, each object keeps track of its last accessing transaction; for an object locked in RdSh, each thread tracks the objects it has read (Section 3.3). Then, for transactional writes only, the instrumentation records the memory location's old value in an undo log. Finally, the access proceeds. LarkTM naturally provides strong atomicity by acquiring its locks at non-transactional as well as transactional accesses. While one could implement weakly atomic LarkTM by eliding nontransactional instrumentation, the semantics would be weaker than SLA (Section 2.4), e.g., the resulting STM would not be privatization or publication safe.
Redundant instrumentation. LarkTM can avoid statically redundant instrumentation to the same object in the same transaction, which can be identified by intraprocedural compile-time dataflow analysis [6] . Instrumentation at a memory access is redundant if it is definitely preceded by a memory access that is at least as "strong" (a write is stronger than a read). Outside of transactions, Lark-TM can avoid instrumenting redundant lock acquires in regions bounded by safe points, since safe points interrupt atomicity [6] .
Scaling with High-Conflict Workloads
As described so far, LarkTM minimizes overhead by making nonconflicting lock acquires as fast as possible. However, conflicting lock acquires-which can significantly outnumber actual transactional conflicts-require expensive coordination. To address this challenge, we introduce LarkTM-S, which targets better scalability. We call the "pure" configuration described so far LarkTM-O since it minimizes overhead.
A contended lock state. To support LarkTM-S, we add a new contended lock state to LarkTM's existing WrEx T , RdEx T , and RdSh states. Our current design uses IntelSTM's concurrency control [49] (Section 2.2) for the contended state. IntelSTM and Lark-TM are fairly compatible because they both use eager concurrency control for writes. Following IntelSTM, LarkTM-S uses unbiased locks for writes to objects in the contended state, incurring an atomic operation for every non-transactional write and every transaction's first write to an object, but never requiring coordination. For reads to an object locked in the contended state, LarkTM-S uses lazy validation of the object's version, which is updated each time an object's write lock is acquired.
Our current design supports changing an object's lock to the contended state at allocation time or as the result of a conflicting lock acquire. It is safe to change a lock to contended state in the middle of a transaction because coordination resolves any conflict, guaranteeing all transactions are consistent up to that point.
Profile-guided policy. LarkTM-S decides which objects' locks to change to the contended state based on profiling lock state changes. It uses two profile-based policies. The first policy is object based: if an object's lock triggers "enough" conflicting lock acquires, the policy puts the lock into the contended state. This policy counts each lock's conflicts at run time; if a count exceeds a threshold, the lock changes to contended state. (We would rather compute an object's ratio of conflicts to all accesses, but counting all accesses at run time would be expensive.)
The object-based policy works well except when many objects trigger few conflicts each. The second, type-based policy addresses this case by identifying object types that contribute to many conflicts. The type-based policy decides whether all objects of a given type (i.e., Java class) should have their locks put in the contended state at allocation time. For each type, the policy decides to put its locks into the contended state if, across all accesses to objects of the type, the ratio of conflicting to all accesses exceeds a threshold. Our implementation uses offline profiling; a production-quality implementation could make use of online profiling via dynamic recompilation. Grouping by type enables allocating objects locked in contended state, but the grouping may be too coarse grained, conflating distinct object behaviors.
Prior work has also adaptively used different kinds of locking for high-conflict objects, based on profiling [9, 53] .
Semantics and progress. Since LarkTM-S validates reads lazily, it permits so-called zombie transactions [27] . Zombie transactions can throw runtime exceptions or get stuck in infinite loops that would be impossible in any unserializable execution. Each transaction must validate its reads before throwing any exception, as well as periodically in loops, to handle erroneous behavior that would be impossible in a serializable execution.
Since our design targets managed languages that provide memory and type safety, zombie transactions cannot cause memory corruption or other arbitrary behaviors [13, 18, 36] . A design for unmanaged languages (e.g., C/C++) would need to check for unserializable behavior more aggressively [13] .
Like IntelSTM and other mixed-mode STMs, LarkTM-S can suffer livelock, since any transaction that fails read validation must abort (Section 2.3). Standard techniques such as exponential backoff [30, 50] help to alleviate this problem. We note that LarkTM-S can in fact guarantee livelock and starvation freedom by forcing a repeatedly aborting transaction to fall back to using entirely eager mechanisms (as though it were executed by LarkTM-O). We have not yet incorporated this feature into our design or implementation.
Comparing STMs
To enhance our evaluation, we implement and compare against two STMs from prior work: NOrec [15] and IntelSTM (the strongly atomic version of McRT-STM) [45, 49] (Section 2.2) . NOrec is generally considered to be a state-of-the-art STM (e.g., recent work compares quantitatively against NOrec [8, 29, 55] ) that provides relatively low single-thread overhead and (for many workloads) good scalability. Although not considered to be one of the bestperforming STMs, IntelSTM is perhaps the highest performance STM from prior work that supports strong atomicity. Table 2 compares features and properties of our STMs and prior work's STMs. LarkTM uses biased reader-writer locks for concurrency control to achieve low overhead. NOrec and IntelSTM use lazy validation for reads in order to avoid the overhead of locking at reads, but as a result they incur other overheads such as logging reads (both), looking up reads in the write set (NOrec), and validating reads (IntelSTM).
IntelSTM, LarkTM-O, and LarkTM-S can avoid redundant concurrency control instrumentation (Section 3.5) because they use object-level locks and/or version validation. NOrec must instru-ment all reads fully since it validates reads using values; NOrec performs only logging (no concurrency control) at writes. None of the STMs can avoid logging at redundant writes because we have implemented an object-granularity dataflow analysis (Section 4).
NOrec provides livelock freedom (i.e., some thread's transaction eventually commits), and IntelSTM makes no progress guarantees. LarkTM-O provides starvation freedom (every transaction eventually commits) by resolving conflicts eagerly and supporting aborting either transaction. LarkTM-S can provide starvation freedom if it uses (LarkTM-O's) fully eager concurrency control for a repeatedly aborting transaction.
NOrec provides weak atomicity (SLA; Section 2.4); a strongly atomic version would need to acquire a global lock at every nontransactional store. The other STMs provide strong atomicity by instrumenting each non-transactional access like a tiny transaction.
Implementation
We have implemented LarkTM-O and LarkTM-S, and NOrec and IntelSTM, in Jikes RVM 3.1.3, a high-performance Java virtual machine [4] . Our implementations are available on the Jikes RVM Research Archive (http://jikesrvm.org/Research+Archive).
Our implementations share features as much as possible, e.g., LarkTM-S uses our IntelSTM code to handle the contended state. Our LarkTM-O and LarkTM-S implementations extend the perobject biased reader-writer locks from the publicly available Octet implementation [6] . Programming model. While our design assumes the programmer only needs to add atomic {} blocks, our implementation requires manual transformation of atomic blocks to support retry and to back up and restore local variables. These transformations are straightforward, and a compiler could perform them automatically. Instrumentation. Jikes RVM's dynamic compilers insert Lark-TM's instrumentation at all accesses in application and Java library methods. A call site invokes a different compiled version of a method depending on whether it is called from a transactional or non-transactional context. The compilers thus compile two versions of each method called from both contexts.
We modify Jikes RVM's dynamic optimizing compiler, which optimizes hot methods, to perform intraprocedural, flow-sensitive dataflow analysis that identifies redundant accesses to the same object (Section 3.5). This analysis is at the object (not field or array element) granularity, so it cannot eliminate the instrumentation at writes that updates the undo log (T.undoLog.add(&o.f) in Section 3.5). IntelSTM, LarkTM-O, and LarkTM-S use this analysis to identify and eliminate redundant instrumentation in transactions.
In non-transactional code, LarkTM-O eliminates redundant instrumentation within regions free of safe points (e.g., method calls, loop headers, and object allocations), since LarkTM's per-object biased locks ensure atomicity interrupted only at safe points. Since any lock acquire can act as a safe point, LarkTM-O adds instrumentation in non-transactional code that executes after a lock state change and reacquires any lock(s) already acquired in the current safe-point-free region, as identified by the redundant instrumentation analysis. Eliminating redundant instrumentation in nontransactional code would not guarantee soundness for IntelSTM since it does not guarantee atomicity between safe points. However, recent work shows that statically bounded regions can be transformed to be idempotent with modest overhead [16, 48] , suggesting an efficient route for eliminating redundant instrumentation. In an effort to make the comparison fair, IntelSTM eliminates instrumentation that is redundant within safe-point-free regions. LarkTM-O and IntelSTM thus use the same redundant instrumentation analysis, as does the hybrid of these two STMs, LarkTM-S. NOrec. The original NOrec design adds instrumentation after every read, which performs read validation if the global sequence lock has changed since the last snapshot [15] . This check is needed for unmanaged languages in order to avoid violating memory and type safety. Our implementation of NOrec targets managed languages, so it safely avoids this check, improving scalability (we have found) by avoiding unnecessary read validation. Our NOrec implementation can thus execute zombie transactions. Zombie transactions. Our implementations of NOrec, IntelSTM, and LarkTM-S can execute zombie transactions because they validate reads lazily (Section 3.6). The implementations must perform read validation prior to commit in a few cases. (NOrec only ever needs to perform read validation if the global sequence lock has changed since the last snapshot [15] .) The implementations perform read validation before throwing any runtime exception from a transaction. The implementations mostly avoid periodic validation since infinite loops in zombie transactions mostly do not occur, except that NOrec has transactions that get stuck in infinite loops for three out of eight STAMP benchmarks. (NOrec presumably has more zombie behavior than IntelSTM since NOrec uses lazy concurrency control for both reads and writes.) For these three benchmarks only, we use a configuration of NOrec that validates reads (only if the global sequence lock has been updated) every 131,072 reads, which adds minimal overhead. Conflict resolution. An aborting transaction retries using the VM's existing runtime exception mechanism. Since retrying from a safe point could leave the VM in an inconsistent state, the implementation defers retry until the next access or attempt to commit. Contention management. To implement LarkTM's age-based contention management, we use IA-32's cycle counter (TSC) for timestamps. Timestamps thus do not reflect exact global ordering (providing exact global ordering could be a scalability bottleneck), but they are sufficient for ensuring progress.
Evaluation
This section evaluates the run-time overhead and scalability of LarkTM-O and LarkTM-S, compared with IntelSTM and NOrec.
Methodology
Benchmarks. To evaluate STM overhead and scalability, we use the transactional STAMP benchmarks [10] . Designed to be more representative of real-world behavior and more inclusive of diverse execution scenarios than microbenchmarks, STAMP continues to be used in recent work (e.g., [8, 15, 20, 29] ). We use a version of STAMP ported to Java by other researchers [17, 34] . We omit a few ported STAMP benchmarks because they run incorrectly, even when running single-threaded without STM on a commercial JVM. Six benchmarks run correctly, including two with both low-and high-contention workloads, for a total of eight benchmarks. Our experiments run the large workload size for all benchmarks, with the following exceptions. We run kmeans with twice the standard large workload size, since otherwise load balancing issues thwart scaling significantly. We use a workload size between the medium and large sizes for labyrinth3d and ssca2 since the large workload exhausts virtual memory on our 32-bit implementation (Jikes RVM currently targets IA-32 but not x86-64).
Although the C version of STAMP includes hand-instrumented transactional loads and stores, the STMs do not use this information. They instead instrument all transactional and nontransactional accesses, except those that are statically redundant or to a few known immutable types (e.g., String). Deuce. For comparison purposes, we evaluate the publicly available Deuce implementation [34] of the high-performance TL2 algorithm [18] . Deuce's concurrency control is at field and array element granularity, which avoids false object-level conflicts but can add instrumentation overhead. We execute Deuce with the Open-JDK JVM since Jikes RVM does not execute Deuce correctly. Eval- uating Deuce helps to determine whether overhead and scalability issues are specific to our STM implementations in Jikes RVM. Platform and scalability. Experiments execute on an AMD Opteron 6272 system running Linux 2.6.32. It has eight 8-core processors (64 cores total) that communicate via a NUMA interconnect. Performance shows little or no improvement beyond 8 threads, and it often degrades (anti-scales). This limitation is not unique to LarkTM or even Jikes RVM: IntelSTM and NOrec, as well as Deuce executed by OpenJDK JVM, experience the same effect. The poor scalability above 8 threads is therefore due to some combination of the benchmarks and platform. The scalability of the STAMP benchmarks is limited [60] , e.g., by load imbalance and communication costs. Communication between threads executing on different 8-core processors is more expensive than intraprocessor communication. Figure 4 shows the scalability of two representative programs for 1-64 threads. The STM configurations generally anti-scale for 16-64 threads for kmeans low, (which is representative of kmeans high, ssca2, and labyrinth3d, and intruder). For vacation low (representative of vacation high and genome), scalability is fairly flat for 16-64 threads, with some anti-scaling.
Across all STMs we evaluate, performance is not enhanced significantly by using more than 8 threads, so our evaluation focuses on 1-8 threads (with execution limited to one 8-core processor).
Appendix A repeats our experiments on an Intel Xeon platform. Experimental setup. We build a high-performance configuration of Jikes RVM that adaptively optimizes the application as it runs. Each performance result is the median of 30 trials, to minimize the effects of any machine noise. We also show the mean, as the center of 95% confidence intervals. Optimizations. All of our implemented STMs except NOrec perform concurrency control at object granularity, which can trigger false conflicts, particularly for large arrays divided up among threads. We refactor some STAMP benchmarks to divide large arrays into multiple smaller arrays; a production implementation could instead provide flexible metadata granularity. In addition, Jikes RVM's optimizing compiler does not aggressively perform optimizations-such as common subexpression elimination and loop unrolling and peeling-that help identify redundant LarkTM instrumentation, so we refactor four programs by applying these optimizations manually. For a fair evaluation, all STMs and the non-STM single-thread baseline execute the refactored programs. Profile-guided decisions. LarkTM-S decides whether to change objects' locks to the contended state based on profiling (Section 3.6). In our experiments, LarkTM-S changes an object's lock to contended state after it performs 256 conflicting accesses. Sensitivity is low: varying the threshold from 1 to 1024 has little impact, except for kmeans, which performs worse for thresholds ≤128.
LarkTM-S uses offline profiling to select types (Java classes) whose instances should be locked into contended state at allocation time. The policy selects types whose ratio of conflicting to nonconflicting accesses is greater than 0.01, excluding common types such as int arrays and Object. It limits the selected types so that at most 25% of the execution's accesses are to contended objects, since otherwise the execution might as well use IntelSTM instead of LarkTM-S. Since profiling and performance runs use the same inputs, they represent a best case for online profiling. Table 3 reports instrumented accesses executed by the four implemented STMs during single-thread execution. (Each statistic reported in the paper is the arithmetic mean of 15 trials.) The table shows that while reads outnumber writes, writes are not uncommon. Several programs spend almost all of their time in transactions, while a few spend significant time executing nontransactional accesses. NOrec instruments more transactional accesses than the other STMs because it cannot exclude instrumentation from redundant accesses (Section 3.5). Transactional writes does not count the undo log instrumentation that IntelSTM, Lark-TM-O, and LarkTM-S add at every transactional write (Section 4). Table 4 reports lock state transitions for LarkTM-O and Lark-TM-S running STAMP with 8 application threads. The Same state column reports how many instrumented accesses require no lock state change, meaning they take the fast path. For LarkTM-O, more than 90% of accesses fall into this category for every program. Conflicting lock acquires require coordination with other thread(s) in order to change the lock's state. Although LarkTM-O achieves a relatively low fraction of lock acquires that are conflicting-always less than 5%-coordination costs affect scalability significantly.
Execution Characteristics
LarkTM-S successfully avoids many conflicting transitions by using the contended state, often reducing conflicting lock acquires by an order of magnitude or more. At the same time, many samestate accesses become contended-state accesses. More than 10% of accesses are to contended objects in four programs (intruder, genome, vacation low, and vacation high). Table 5 counts transactions committed and aborted for the four STMs implemented in Jikes RVM, running STAMP with 8 threads. Different conflict resolution and contention management policies lead to different abort rates for the STMs. Several programs have a trivial abort rate; others abort roughly 10% of their transactions. LarkTM-O and LarkTM-S have different abort rates because Lark-TM-S uses IntelSTM's conflict resolution and contention management for contended accesses. Although we might expect Intel-STM's suboptimal contention management to lead to more aborts, the implementations are not comparable: LarkTM always resolves conflicts by aborting a thread, while IntelSTM waits for some time (rather than aborting immediately) for a contended lock to become available. NOrec often has the lowest abort rate, mainly (we believe) because it performs conflict detection at field and array element granularity, so its transactions do not abort due to false sharing. In contrast, the other STMs detect conflicts at object granularity. As our performance results show, abort rates alone do not predict scalability, which is influenced strongly by other factors such as LarkTM's coordination protocol and NOrec's global lock.
Performance Results
This section compares the performance of the STMs with each other and with uninstrumented, single-thread execution. Single-thread overhead. Transactional programs execute multiple parallel threads in order to achieve high performance. Nonetheless, single-thread overhead is important because it is the starting point for scaling performance with more threads. Existing STMs have struggled to achieve good performance largely because of high instrumentation overhead (Section 2.2) [11, 59] . Figure 5 shows the single-thread overhead (i.e., instrumentation overhead) of the five STMs, compared to single-thread perfor- Table 5 . Transactions committed and aborted at least once for four STMs.
k m e a n s _ l o w k m e a n s _ h i g h s s c a 2 i n t r u d e r l a b y r i n t mance on Jikes RVM without STM, except for Deuce, which is normalized to single-thread performance on OpenJDK JVM. Deuce slows programs by almost 6X on average relative to baseline Open-JDK JVM, which we find is 33% faster than Jikes RVM on average. Our NOrec and IntelSTM implementations slow single-thread execution significantly-by 2.9 and 3.3X on average-despite targeting low overhead. NOrec in particular aims for low overhead and reports being one of the lowest-overhead STMs [15] . IntelSTM targets low overhead by combining eager concurrency control for writes with lazy read validation [49] . Yet they still incur significant costs: NOrec buffers each write; and it looks up each read in the write set and (if not found) logs the read in the read validation log. IntelSTM performs atomic operations at many writes, and it logs and later validates reads. LarkTM-O yields the lowest instrumentation overhead (1.40X on average), since it minimizes instrumentation complexity at non-conflicting accesses. LarkTM-S's single-thread slowdown is 1.73X; its instrumentation uses atomic operations and read validation for accesses to objects with locks in contended state. In single-thread execution, LarkTM-S puts objects into contended state based on offline type-based profiling only.
An outlier is ssca2, for which NOrec performs the best, since a high fraction of its accesses are non-transactional (Table 3) . While kmeans low and kmeans high also have many non-transactional accesses, the overhead of its transactional accesses, which execute in relatively short transactions, is dominant.
IntelSTM's very high overhead on labyrinth3d is related to its long transactions, which lead to large read and write sets. Intel-STM's algorithm has to validate some read set entries by linearly searching the (duplicate-free) write sets, adding substantial overhead for labyrinth3d because its write sets are often large. Intel-STM could avoid this linear search by incurring more overhead in the common case, as in a related design [28] . If we remove the validation check, IntelSTM still slows labyrinth3d's single-thread execution by 4X.
NOrec also adds high overhead for labyrinth3d. We find that whenever the instrumentation at a read looks up the value in the write set, the average write set size is about 64,000 elements. In contrast, the average write set size is at most 16 elements for any other program. Although our NOrec implementation uses a hash table for the write set, it is plausible that larger sizes lead to moreexpensive lookups (e.g., more operations and cache pressure).
Scalability. Figure 6 shows speedups for the STMs over non-STM single-thread execution for 1-8 threads. Each single-thread speedup is simply the inverse of the overhead from Figure 5 .
Deuce, NOrec, and IntelSTM scale reasonably well overall, but they start from high single-thread overhead, limiting their overall best performance (usually at 8 threads). LarkTM-O has the lowest single-thread overhead on average, yet it scales poorly for several programs that have a high fraction of accesses that trigger conflicting transitions-particularly genome and intruder. Execution time increases for vacation low and vacation high from 1 to 2 threads because of the cost of coordination caused by conflicting lock acquires, then decreases after adding more threads and gaining the benefits of parallelism. LarkTM-S achieves scalability approaching IntelSTM's scalability because LarkTM-S effectively eliminates most conflicting lock acquires. Starting at two threads, LarkTM-S provides the best average performance by avoiding most of LarkTM-O's coordination costs while retaining most of its lowcost instrumentation benefits. Just as prior STMs have struggled to outperform single-thread execution [2, 11, 20, 59] , Deuce, NOrec, and IntelSTM are unable, on average, to outperform non-STM single-thread execution. In contrast, LarkTM-O and LarkTM-S are 1.07X and 1.69X faster, respectively, than (non-STM) single-thread execution. Summary. Across all programs, LarkTM-O provides the lowest single-thread overhead, NOrec and IntelSTM typically scale best, and LarkTM-S does well at both.
Conclusion
LarkTM's novel design provides low overhead, progress guarantees, and strong semantics. LarkTM-O provides the lowest overhead, and the best performance for low-contention workloads. LarkTM-S uses mixed concurrency control, yielding the best overall performance, outperforming existing high-performance STMs. Figure 7 shows speedups for each STAMP benchmark and the geomean. Single-thread overhead and scalability are similar across both platforms. As on the AMD platform, NOrec, IntelSTM, and LarkTM-O have similar performance on average on the Intel platform, although LarkTM-O performs slightly worse in comparison on the Intel platform. On both platforms, LarkTM-S significantly outperforms the other STMs on average. Figure 8 shows scalability for 1-32 threads for the same two representative STAMP benchmarks as Figure 4 . Although on vacation low the STMs may seem to scale better on the Intel machine, we note that Figure 8 evaluates only 1-32 threads.
