Hybrid Transactional Memory (TM) uses available hardware TM resources to execute language-level transactions, and falls back to a software TM implementation for those transactions that cannot complete in hardware. Ideally, a hybrid TM would allow hardware and software transactions to run concurrently, but would not waste hardware TM resources on coordination between the two classes of transactions. In addition, it should scale well, incur little latency, offer strong safety guarantees, and provide some degree of fairness.
Introduction
Since the time when Hybrid Transactional Memory (TM) was first proposed [6] , hardware TM (HTM) support has become available in microprocessors from IBM [11, 21] and Intel [10] . These HTM systems are "best effort", meaning that they do not guarantee that they will successfully commit any transaction attempt. Failure may arise for many reasons, to include conflicts with other transactions, memory footprints that exceed the HTM capacity, system calls, and timer interrupts. The goal of Hybrid TM (HyTM) is to exploit best-effort HTM whenever possible, and fall back to software TM (STM) when a transaction cannot complete in hardware [6] . This approach promises to scale well and incur low latency when most transactions complete in hardware, with worst-case overhead and scalability comparable to the underlying STM.
The traditional approach to implementing HyTM is to begin with an STM, and try to accelerate it using HTM. Early STM algorithms required interaction with per-location metadata, and hybrid versions of these algorithms wasted limited hardware capacity on this metadata [6, 12, 16] . Worse yet, false sharing of cache lines that held metadata could result in additional HTM aborts, and increased fallback to the STM path. The use of NOrec STM [5] as a baseline enabled HyTM algorithms to avoid per-access overheads. In NOrecbased HyTM [4, 14, 16] , a sequence lock serializes the commit of the STM, and all conflicts are detected by comparing the values read by transactions. However, NOrec-based HyTM algorithms suffer from a scalability bottleneck, since hardware transactions must read, and often write, the sequence lock. Aborts from these accesses could be avoided if the hardware allowed nontransactional accesses [4] , but the accesses themselves are necessary. Furthermore, if these accesses are delayed until the end of the transaction [2,3], the TM ceases to provide the minimum safety requirement of opacity [9] , and it can admit erroneous behavior [7] . However, "eager subscription" to the metadata for coordinating hardware and software transactions causes all hardware transactions to abort on any software commit.
The most recent innovation in HyTM is to add hardware acceleration to the STM path, as in Reduced Hardware NOrec (RHNOrec) [14] . The resulting "reduced transaction" technique transforms certain software transactions into hardware transactions, thereby avoiding fallback to a slow STM. The current state of the art achieves performance comparable to Hybrid NOrec, but does not require nontransactional loads.
A common assumption among HyTM algorithms is that STM and HTM transactions should coexist at any time, with neither favored over the other. In contrast, PhaseTM [13] , required all transactions to use same mode, whether HTM, STM, or serialized on a single lock. Mode switches were expensive, but in return the HTM mode had no overhead for interacting with STM. The most popular HyTM in practice today is a PhaseTM that switches between HTM mode and a single global lock [24] . If we accept that HTM capacities are more likely to increase than to decrease, then we may assume that STM fall-back will grow increasingly rare. However, as core counts rise, fall-back to a single lock becomes increasingly untenable. These observations motivate our approach. We seek to make the common case (HTM) as fast as possible, by avoiding interaction with (unlikely) concurrent software transactions. When a software transaction is needed, we want it to finish as quickly as possible, to limit its impact on current and future hardware transactions. We also require the HyTM to be opaque.
The innovation we propose is to prioritize software transactions while they are running, by augmenting the Cohorts algorithm [18] . In Cohorts, transactions block at their commit point, until such time as all threads are either (a) ready to commit a transaction, or (b) not executing a transaction. This allows software transactions to avoid any high-latency global metadata accesses during execution. In Hybrid Cohorts (HyCo), we prevent hardware transactions from committing when software transactions are in-flight. We also apply the reduced transaction technique to the Cohorts commit phase, which prevents blocking and eliminates a bottleneck from Cohorts STM. The net effect is an opaque HyTM that scales well and avoids bottlenecks for hardware transactions.
The remainder of this paper is organized as follows. In Section 2, we discuss the overall approach of the Hybrid Cohorts algorithm, with a focus on the state machine that governs transaction behavior. Section 3 presents the pseudocode for one implementation of the state machine, which aims to limit the impact on transactions that use HTM resources throughout their execution. In Section 4, we
