Implementation Tradeoffs in the Design of Flexible Transactional Memory Support by Arrvindh Shriraman et al.
Implementation Tradeoffs in the Design of
Flexible Transactional Memory Support1
Arrvindh Shriraman Sandhya Dwarkadas Michael L. Scott
Department of Computer Science, University of Rochester
fashriram,sandhya,scottg@cs.rochester.edu
Abstract
We present FlexTM (FLEXible Transactional Memory), a high performance TM framework that
allows software to determine when (eagerly, lazily, or in a mixed fashion) and how to manage con-
ﬂicts, while employing hardware to manage transactional state and to track conﬂicts. FlexTM co-
ordinates four decoupled hardware mechanisms: read and write signatures, which summarize per-
thread access sets; per-thread conﬂict summary tables (CSTs), which identify the processors with
which conﬂicts have occurred; Programmable Data Isolation, which buffers speculative updates
in the local cache and uses an overﬂow table to handle unbounded updates; and Alert-On-Update,
which notiﬁes a thread immediately when a speciﬁed location is written by another processor. The
CSTs enable an STM-inspired commit protocol that manages conﬂicts in a decentralized manner
(no global arbitration) and allows parallel commits.
We explore the implementation tradeoffs associated with FlexTM’s versioning and conﬂict de-
tectionmechanisms. OurresultsdemonstratethatFlexTMexhibits5 speedupoverhigh-quality
software TMs, and 1.8 speedup over hybrid TMs (those with software always in the loop), with
no loss in policy ﬂexibility. We ﬁnd that the distributed commit protocol improves performance
by 2–14% over an aggressive centralized-arbiter mechanism (similar to BulkTM [7]) that also al-
lows parallel commits. Finally, we compare the use of an aggressive hardware controller (as used
in the base FlexTM design) to manage and to access any speculative transaction state overﬂowed
from the cache, to a hardware-software approach dubbed FlexTM-S (FlexTM-Streamlined), where
software manages the overﬂow region but uses a metadata cache to accelerate speculative data re-
placements and their subsequent accesses. We demonstrate that FlexTM-S’s performance is within
10% of FlexTM’s despite its substantially simpler virtualization mechanism.
1. Introduction
Transactional Memory (TM) addresses one of the key challenges of programming multi-core sys-
tems: the complexity of lock-based synchronization. At a high level, the programmer or compiler
labels sections of the code in a single thread as atomic. The underlying system is expected to
execute this code in an all-or-nothing manner, and in isolation from other transactions, while ex-
ploiting as much concurrency as possible.
Most TM systems execute transactions speculatively, and must thus be prepared for data con-
ﬂicts, when concurrent transactions access the same location and at least one of the accesses is
Preprint submitted to JPDC Special Issue on Transactional Memory February 28, 2010a write. A conﬂict detection mechanism is needed to identify such conﬂicts so that the system
can ensure that transactions don’t perform erroneous externally visible actions as a result of an
inconsistent view. The conﬂict resolution time decides when the detected conﬂicts (if they still
persist) are managed. To resolve conﬂicts, a conﬂict manager is responsible for the policy used to
arbitrate among conﬂicting transactions and decide which should abort. Most TM systems blend
detection and resolution: pessimistic (eager) systems perform both as soon as possible; optimistic
(lazy) systems delay conﬂict resolution until commit time (although they may detect conﬂicts ear-
lier). TM systems must also perform version management, either buffering new values in private
locations (a redo log) and making them visible at commit time, or buffering old values (an undo
log) and restoring them on aborts. In the taxonomy of Moore et al. [31], undo logs are considered
an orthogonal form of eagerness (they put updates in the “right” location optimistically); redo logs
are considered lazy.
The mechanisms required for conﬂict detection, conﬂict resolution and management, and ver-
sion management can be implemented in hardware (HTM) [1, 19, 21, 31, 32], software (STM) [14,
15, 20, 27, 33], or some hybrid of the two in a hardware-accelerated TM (HaTM) [13, 23, 30, 39].
Full hardware systems are typically inﬂexible in policy, with ﬁxed choices for eagerness of conﬂict
resolution, strategies for conﬂict arbitration and back-off, and eagerness of versioning.
Software-only systems are typically slow by comparison, at least in the common case. Several
systems [7, 39, 46] have advocated decoupling the hardware components of TM, giving each a
well-deﬁned API that allows them to be implemented and invoked independently. Hill et al. [22]
argue that decoupling makes it easier to reﬁne an architecture incrementally. At ISCA’07 [39], we
argued that decoupling helps to separate policy from mechanism, allowing software to choose a
policy dynamically. Both groups suggest that decoupling may allow TM components to be used
for non-transactional purposes [22][39, TR version].
Several papers have identiﬁed performance pathologies with certain policy choices (eagerness
of conﬂict resolution; management policy and back-off strategy) in certain applications [5, 37,
39, 42]. RTM [39] promotes policy ﬂexibility by decoupling version management from conﬂict
detection and management—speciﬁcally, by separating data and metadata, and performing con-
ﬂict detection only on the latter. While RTM’s conﬂict detection mechanism enforces immediate
conﬂict resolution, software can choose (by controlling the timing of metadata inspection and
updates) when conﬂicts are resolved. Unfortunately, metadata management imposes noticeable
performance overheads and complicates the programming interface [39].
The FlexTM Approach We propose FlexTM (FLEXible Transactional Memory) [40], a TM
design that separates conﬂict detection from resolution and management, and leaves software in
charge of the latter. Simply put, hardware always detects conﬂicts eagerly during the execution
of a transaction and records them, but software chooses when to notice and what to do about it.
Unlike proposed eager systems, FlexTM allows conﬂicting transactions to execute concurrently to
uncover potential parallelism and unlike proposed lazy systems, FlexTM doesn’t postpone detec-
tion to the commit stage, permitting a lightweight commit. FlexTM employs a set of decoupled
(separable) hardware primitives to support version management and conﬂict detection without the
need for sophisticated software metadata, which improves performance.
Speciﬁcally, FlexTM deploys four hardware mechanisms: (1) Bloom ﬁlter signatures (as in
2Bulk [7] and LogTM-SE [46]) to track and summarize a transaction’s read and write sets; (2)
Conﬂict Summary Tables (CSTs) to concisely capture conﬂicts between transactions; (3) the ver-
sioning system of RTM (programmable data isolation—PDI), augmented with a hardware con-
troller to maintain the cache overﬂows in a pre-allocated region; and (4) RTM’s Alert-On-Update
mechanism to help transactions respond to changes in their status. The hardware structures are
fully visible and under software controls which enables FlexTM to handle context switching and
paging.
A key contribution of FlexTM is a commit protocol that arbitrates between transactions in a
distributed fashion, allows parallel commits of an arbitrary number of transactions, and imposes
performance penalties proportional to the number of transaction conﬂicts. The protocol enables
lazy conﬂict resolution without commit tokens [19], broadcast of write sets [7, 19], or ticket-based
serialization [9]. To our knowledge, FlexTM is the ﬁrst hardware TM in which the decision to
commit or abort can be an entirely local operation, even when performed lazily by an arbitrary
number of threads in parallel.
Simplifying the Overﬂow Mechanism In FlexTM [40], a notable source of complexity is that
a hardware controller is used to manage transactional state evicted from the cache. Hardware is
expected to maintain the overﬂowed cache blocks in an overﬂow table and has to perform all the
management tasks (i.e., insert, remove, lookup, commit).
We also investigate a hardware-software approach to handling overﬂowed state. Speciﬁcally,
we develop light-weight ﬁne-grain address remapping support. On an overﬂow, software sets up
the buffer space and provides hardware fast access to the metadata (which provides the address of
the buffer space allocated for the original location) via a metadata cache (SM-cache). Hardware
uses this information to write back and access the block from the buffer space region. The extra
hardware needed for overﬂow management is limited to the SM-cache, installed as a lookaside on
the L1 miss path. We evaluate this simpliﬁed overﬂow mechanism in an alternative TM system,
FlexTM-S (FlexTM-Streamlined).
Simplifying Conﬂict Resolution In order to support Lazy conﬂict resolution, the conﬂict de-
tection and versioning mechanisms must handle potentially multiple (as many as there are sharers)
copies of transactional data. In conjunction with FlexTM-S, we explore mixed conﬂict resolu-
tion, which resolves write-write conﬂicts eagerly while allowing lazy read-write conﬂict resolu-
tion [38, 41]. Mixed resolution enables FlexTM-S to employ a simpler versioning mechanism
(only one speculative data version like Eager detection) and to precisely identify writer conﬂicts
(only one writer like Eager detection).
We have implemented FlexTM on a 16-core (1 thread/core) CMP prototype on the Sim-
ics/GEMS simulation framework. We interface the private L1s with the shared L2 using a
directory-basedprotocol. WeinvestigateperformanceusingtheSTAMP[30]andSTMBench7[18]
workload suite. Our results suggest that FlexTM’s performance is comparable to that of ﬁxed-
policy HTMs, and 1.8 and 5 better than that of hybrid TMs and plain STMs, respectively. We
demonstratethattheCST-basedcommitprocessinFlexTMcanavoidsigniﬁcantlatencyandserial-
ization penalties associated with globally-arbitrated commit mechanisms in other lazy systems. Fi-
nally, comparing with various virtualization techniques, we demonstrate that the more complexity-
effective FlexTM-S suffers modest performance loss compared to FlexTM (<10%) and performs
3much better ('2) compared to other previously proposed virtualization designs [30, 11].
2. Related Work
Larus and Rajwar [24] provide an excellent summary of transactional memory research up to Fall
2006. We ﬁrst categorize the design space for both versioning and conﬂict detection mechanisms,
indicating the relation of FlexTM to other proposed protocols.
Most HTM designs have similar approaches to small transactions, exploiting coherence for con-
ﬂict detection and (possibly) the cache hardware for versioning. However, overﬂowed state intro-
duces virtualization challenges, and proposed implementations vary signiﬁcantly. HTMs need to
address two requirements to support overﬂowed state: (1) a conﬂict detection mechanism to track
concurrent accesses and conﬂicts for locations evicted out of the caches and coherence framework
and (2) a versioning mechanism to maintain new and old values of data.
Conﬂict Detection TM systems need a mechanism to track the locations read and written by
a transaction and to ensure that there is no overlap of a transaction’s accesses with the write set of
a concurrent transaction. The implementation choices can be broadly classiﬁed as:
 Software Instrumentation: The TM runtime system can be implemented as a set of software
barriers on load and store instructions. These barriers gather information about the accesses,
maintain the data structures (e.g., transaction read and write sets), and query the data struc-
tures to perform conﬂict detection. They verify that an accessed location hasn’t changed,
and ensure that an aborted transaction doesn’t perform any externally visible actions prior to
discovering its status. In addition to imposing the instrumentation overhead that limits gains
from concurrency, the barriers add to cache pressure since they access additional metadata
on each memory access.
 Hardware Acceleration: Hardware support can remove the bottlenecks associated with soft-
ware instrumentation by using the cache to track accesses and piggybacking on coherence to
detect conﬂicts. There are tradeoffs associated with different hardware options: Bloom-ﬁlter
based signatures [7, 46] are simple to design but are prone to false-positive based perfor-
mance problems. Per-location metadata [6] is precise but requires support at all levels in the
memory hierarchy and modiﬁcations to the coherence protocol. Cache tag bits either require
a software algorithm [13] or a complex hardware controller [1] to handle evictions.
 Virtual Memory: OS page tables include protection bits to implement process isolation.
TM systems can exploit these protection bits to set up read-only and read/write permissions
at a page granularity and trap concurrent accesses to detect conﬂicts [10, 11]. The major
performance overheads involved are the TLB shootdowns and OS intervention required.
Versioning The conﬂict resolution policy governs the choice of versioning mechanism. Lazy
allows concurrent transactions to read or write a shared location thus necessitating a redo log
in order to avoid irreversible actions while Eager detects conﬂicts prior to the access, thereby
accommodating both forms of logging. The undo log is used to restore values if a transaction
aborts while the redo log is used to copy-update the original locations on commit; these actions
4need to occur in an atomic manner for all the locations in the log. Most importantly, since a redo
log buffers new values, it needs to intervene on all other accesses to ensure that a thread reads its
own writes; this dictates the data structure used to maintain the new values (typically a hash table).
An undo log approach can make do with a simpler data structure (e.g., a dynamically resizable
array or vector) and typically doesn’t need to optimize the access cost since it is traversed only on
an abort.
Like conﬂict detection, versioning can be implemented either with software handlers, hardware
acceleration, or virtual memory (i.e., translation information in the page tables) with performance
and complexity tradeoffs similar to conﬂict detection. The software approach adds barriers to all
writes (to set up the log data structures) and possibly to all reads (to return values buffered in a redo
log) and leads to signiﬁcant degradation in performance. The hardware approach adds signiﬁcant
complexity, including new state machines that interact in a non-trivial manner with the existing
memory hierarchy. The virtual memory approach reuses existing hardware and OS support, but
suffers the performance overheads of having to perform page granularity cloning and buffering. An
important difference between the mechanisms that implement versioning and conﬂict detection is
that versioning deals with data values (no false positives or negatives) and cannot trade precision
for complexity-effectiveness as conﬂict detection can (e.g., using signatures).
Table 1: Virtualization in TM
System Conﬂict Resolution Conﬂict Detection Versioning
UTM [1] Eager H (controller) H (undo log)
VTM [32] Eager H (microcode) S (redo log)
XTM [11] Eager/Lazy VM VM (redo log)
PTM-Select [10] Eager H (controller) VM (undo log)
LogTM-SE [46] Eager H (Signature) H (undo log)
TokenTM [6] Eager H (ECC) H (undo log)
Hybrid Systems
SigTM [30] Eager/Lazy H (signature) S (redo log)
UFO TM [2] Eager H (ECC) S (undo log)
HyTM [13] Eager S S (undo log)
RTM [39] Eager/Lazy S S (redo log)
FlexTM [40] Eager/Lazy H (signature) H/S (redo log)
H — Hardware Acceleration S — Software Instrumentation VM — Virtual Memory
Hybrids (other than SigTM) use a best-effort HTM for small transactions.
Table 1 speciﬁes the mechanism used by various extant TM systems. UTM [1] and VTM [32]
both implement overﬂow support for redo logs. On a cache miss in UTM, a hardware controller
walks an uncacheable in-memory data structure that speciﬁes access permissions. VTM employs
tables maintained in software and uses software routines to walk the table only on cache misses
if an overﬂow signature indicates that the block has been speculatively modiﬁed. VTM and UTM
support only eager resolution of conﬂicts. XTM [11] and PTM [10] use virtual memory, and accept
the costs of coarse granularity and OS overhead.
LogTM-SE [46] integrates the undo log mechanism of LogTM [31] with Bulk-style signa-
tures [7]. It supports efﬁcient virtualization (i.e., context switches and paging), but this is closely
tied to eager versioning (undo logs), which cannot support Lazy systems. Since LogTM-SE does
not allow transactions to abort one another, it is possible for running transactions to “convoy”
5behind a suspended transaction. Like LogTM-SE, TokenTM [6] uses undo logs to implement
versioning, but implements conﬂict detection using a hardware token scheme.
Hybrid TMs [13, 23] allow hardware to handle common-case bounded transactions and fall
back to software for transactions that overﬂow time and space resources. To allow hardware and
software transactions to co-exist, Hybrid TMs must maintain metadata compatible with the fall-
back STM and use policies compatible with the underlying HTM. SigTM [30] employs hardware
signatures for conﬂict detection but uses a (always on) TL2 [14] style software redo log for version-
ing. Like hybrid systems, it suffers from per-access metadata bookkeeping overheads. It restricts
conﬂict management policy (speciﬁcally, only self aborts) and requires expensive commit-time
arbitration on every speculatively written location.
RTM [39] explored hardware acceleration of STM. Speciﬁcally, it introduced (1) Alert-On-
Update (AOU), which triggers a software handler when pre-speciﬁed lines are modiﬁed remotely,
and (2) Programmable Data Isolation (PDI), which buffers speculative writes in (temporarily inco-
herent) local caches. Unfortunately, to decouple version management from conﬂict detection and
management, RTM software had to segregate data and metadata, retaining much of the bookkeep-
ing cost of all-software TM systems.
3. FlexTM Architecture
FlexTM provides hardware mechanisms for access tracking, conﬂict tracking, versioning, and ex-
plicit aborts. We describe these separately and then discuss how they work together. Figure 2
shows all the hardware components we add to the system.
3.1. Access Tracking: Signatures
Like Bulk [7], LogTM-SE [46], and SigTM [30], FlexTM uses Bloom ﬁlter signatures [3] to
summarize the read and write sets of transactions in a concise but conservative fashion (i.e., false
positives but no false negatives). Signatures decouple conﬂict detection from critical L1 tag arrays
and enable remote requests to test for conﬂicts using local processor state without walking in-
memory structures, as might be required in UTM [1] and VTM [32] in the case of overﬂow. Every
FlexTM processor maintains a read signature (Rsig) and a write signature (W sig) for the current
transaction. The signatures are updated by the processor on transactional loads and stores. They
allow the controller to detect conﬂicts when it receives a remote coherence request.
3.2. Conﬂict Tracking: CSTs
Existing proposals for both Eager [1, 31] and Lazy [30, 7, 19] systems track conﬂicts on a cache-
line-by-cache-line basis. FlexTM, by contrast, tracks conﬂicts on a processor-by-processor basis
(virtualized to thread-by-thread). Speciﬁcally, each processor has three Conﬂict Summary Tables
(CSTs), each of which contains one bit for every other processor in the system. Named R-W, W-R,
and W-W, the CSTs indicate that a local read (R) or write (W) has conﬂicted with a read or write
(as suggested by the name) on the corresponding remote processor. The W-R and W-W lists at a
processor P represent the transactions that might need to be aborted when the transaction at P wants
to commit. The R-W list helps disambiguate abort triggers; if an abort is initiated by a processor
not marked in the CST, the transaction can safely avoid the abort. On each coherence request, the
controller reads the local W sig and Rsig, sets the local CSTs accordingly, and includes information
6in its response that allows the requestor to set its own CSTs to match. While CSTs can be read and
written independently, they do require interfacing with a mechanism to detect conﬂicts when they
occur. In FlexTM, we use signatures to detect conﬂicts, but the CSTs could be adapted to interface
with any of the hardware metadata schemes discussed in Section 2.
3.3. Versioning Support: PDI
RTM [39] proposed programmable data isolation (PDI) that allowed software to exploit incoher-
ence (when desired). It proposed a Transactional-MESI (TMESI) snooping protocol that supported
multiple speculative writers and exploited the inherent buffering capabilities of private caches to
isolate the potentially multiple speculative copies. Programs use explicit TLoad and TStore in-
structions to inform the hardware of transactional memory operations: TStore requests isolation of
a speculative write, whose value will not propagate to other processors until commit time. TLoad
allows local caching of (non-speculative versions of) remotely TStored lines. When speculatively
modiﬁed state ﬁts in the private cache, PDI avoids the latency and bandwidth penalties of logging.
FlexTM adapts TMESI to a directory protocol and simpliﬁes the management of speculative
reads, adding only two new stable states to the base MESI protocol, rather than the ﬁve employed
in RTM. The TMESI protocol is derived from the SGI ORIGIN 2000 [25] with support for silent
evictions. Directory information is maintained at the L2. Details appear in Figure 1.
Local L1 controllers respond to both the requestor and the directory (to indicate whether the
cache line has been dropped or retained). Requestors issue a GETS on a read (Load/TLoad) miss
in order to get a copy of the data, a GETX on a normal write (Store) miss/upgrade in order to gain
exclusive access and an updated copy (in case of a miss), and a TGETX on a transactional store
(TStore) miss/upgrade.
A TStore results in a transition to the TMI state in the L1 cache (encoded by setting both the
T bit and the MESI dirty bit—Figure 2). A TMI line reverts to M on commit (propagating the
speculative modiﬁcations) and to I on abort (discarding speculative values). On the ﬁrst TStore to
a line in M, TMESI writes back the modiﬁed line to L2 to ensure subsequent Loads get the latest
non-speculative version. To the directory, the local TMI state is analogous to the conventional E
state. The directory realizes that the processor can transition to M (silent upgrade) or I (silent
eviction), and any data request needs to be forwarded to the processor to detect the latest state. The
only modiﬁcation required at the directory is the ability to support multiple speculative writers. We
do this by extending the existing support for multiple sharers and use the modiﬁed bit to distinguish
between the possibility of multiple readers and multiple writers.
We add requestors to the sharer list when they issue a TGETX request and ping all of them on
other requests. On remote requests for a TMI line, the L1 controller sends a Threatened response.
In addition to transitioning the cache line to TMI, a TStore also updates the W sig. TLoad likewise
updates the Rsig. TLoads when threatened move to the TI state, encoded by setting the T bit
when in the I (invalid) state. TI lines must revert to I on commit or abort, because if a remote
processor commits its speculative TMI block, the local copy could go stale. The TI state appears
as a conventional sharer to the directory. (Note that a TLoad from E or S can never be threatened;
the remote transition to TMI would have moved the line to I. Unlike RTM, which keeps track of
transactional read sets within the cache state, FlexTM is able to eliminate RTM’s extra states by
using separate read signatures.)
7M
E
S
TMI
TI
I
X /
INV-ACK
X /
INV-ACK
GETS /
Flush
X / Flush
TStore /
Flush
TStore /—
TStore /
TGETX
TStore /
TGETX
TStore /
TGETX
GETX / INV-ACK
TGETX / EXP-RD;
GETS / S
TLoad /
GETS(T)
Store /
GETX
Store /
GETX
Store /—
GETS / S
Load / GETS(T);
X / INV-ACK; GETS /—
Load,Store,TLoad /—
Load,TLoad,TStore /—;
TGETX,GETX,GETS / T
Load,TLoad /
GETS(S
_
,T
_
)
Load,TLoad /
GETS(S,T
_
)
Load,
TLoad /—
Load,TLoad /—
PDI States
COMMIT
ABORT
State Encoding
MESI
M bit V bit T bit
M 1 0 0
E 1 1 0
S 0 1 0
I 0 0 0
TMI 1 0 1
TI 0 0 1
Responses to requests that hit in W sig or Rsig
Request Msg Hit in W sig Hit in Rsig
GETX Threatened Invalidated
TGETX Threatened Exposed-Read
GETS Threatened Shared
Requestor CST set on coherence message
Local op. Response Message
Threatened Exposed-Read
TLoad R-W —
TStore W-W W-R
Figure 1: Dashed boxes enclose the MESI and PDI subsets of the state space. Notation on transitions is
conventional: the part before the slash is the triggering message; after is the resulting action (‘–’ means
none). GETS indicates a request for a valid sharable copy; GETX for an exclusive copy; TGETX for a copy
that can be speculatively updated with TStore. X stands for the set fGETX, TGETXg. “Flush” indicates a
data block response to the requestor and directory. S indicates a Shared message; T a Threatened message.
Plain, they indicate a response by the local processor to the remote requestor; parenthesized, they indicate
the message that accompanies the response to a request. An overbar means logically “not signaled”.
FlexTM enforces the single-writer or multiple-reader invariant for non-transactional lines. For
transactional lines, it ensures that (1) TStores can only update lines in TMI state, and (2) TLoads
that are threatened can cache the block only in TI state. Software is expected to ensure that at most
one of the conﬂicting transactions commits. It can restore coherence to the system by triggering an
Abort on the remote transaction’s cache, without having to re-acquire exclusive access to store sets.
Previous lazy protocols [7, 19] forward invalidation messages to the sharers of the store set and
enforce coherence invariants at commit time. In contrast, TMESI forwards invalidation messages
at the time of the individual TStores, and arranges for concurrent transactional readers (writers) to
use the TI (TMI) state; software can then control when (and if) invalidation takes place and without
the need for bulk coherence messages.
Transaction commit is requested with a special variant of the CAS (compare-and-swap) instruc-
tion. Like a normal CAS, CAS-Commit fails if it does not ﬁnd an expected value in memory. It
also fails if the caller’s W-W or W-R CST is nonzero. As a side effect of success, it simultane-
ously reverts all local TMI and TI lines to M and I, respectively (achieved by ﬂash clearing the T
bits). On failure, CAS-Commit leaves transactional state intact in the cache. Software can clean up
transactional state by issuing an Abort to the controller, which then reverts all TMI and TI lines to
I (achieved by conditionally clearing the M bits based on the T bits and then ﬂash clearing the T
8User Registers
Data Sharer List State Tag
Processor Core
L2$
Signatures
RS
WSsig
sig
W sig Rsig
Cores Summary
Tag State AT Data
L1 D$
Private L1 Cache Controller Overflow Table Controller
Miss
L1
Shared L2 Cache Controller
Context Switch Support
R−W 
W−R 
W−W 
CSTs
AOU Control
PDI Control
CMPC   AbortPC
Thread Id
Osig
Over. Count
Comm./Spec.
V. Base
P. Base
# Sets
# Ways
Figure 2: FlexTM architecture overview (dark lines surround FlexTM-speciﬁc state).
bits).
Conﬂict Detection On forwarded L1 requests from the directory, the local cache controller
tests its read and write signatures and appends an appropriate message type to its response, as
shown in the table in Figure 1. Threatened indicates a write conﬂict (hit in the W sig), Exposed-
Read indicates a read conﬂict (hit in the Rsig), and Shared or Invalidated indicates no conﬂict. On
a miss in the W sig, the result from testing the Rsig is used; on a miss in both, the L1 cache responds
as in normal MESI. The local controller also piggybacks a data response if the block is currently
in M state. When it sends a Threatened or Exposed-Read message, a responder sets the bit corre-
sponding to the requestor in its R-W, W-W, or W-R CSTs, as appropriate. The requestor likewise
sets the bit corresponding to the responder in its own CSTs, as appropriate, when it receives the
response.
3.4. Explicit Aborts: AOU
The Alert-On-Update (AOU) mechanism, borrowed from RTM [39], supports synchronous noti-
ﬁcation of conﬂicts. To use AOU, a program marks (ALoads) one or more cache lines, and the
cache controller effects a subroutine call to a user-speciﬁed handler if the marked line is invali-
dated. Alert traps require simple additions to the processor pipeline. Modern processors already
include trap signals between the Load-Store-Unit (LSU) and Trap-Logic-Unit (TLU) [48]. AOU
adds an extra message to this interface and an extra mark bit, ‘A’, to each line in the L1 cache. (An
overview of the FlexTM hardware required in the processor core, the L1 controller, and the L2
controller appears in Figure 2.) RTM used AOU to detect software-induced changes to (a) trans-
action status words (indicating an abort) and (b) the metadata associated with objects accessed in
9a transaction (indicating conﬂicts). FlexTM requires AOU support for only one cache line (the
transaction status word; see Section 4.1) and can therefore use the simpliﬁed hardware mechanism
(avoiding the bit per cache tag) as proposed in [43]. More general AOU support might still be
useful for non-transactional purposes.
3.5. Extending FlexTM
Support for Multi-threading Multi-threaded cores pose two main challenges to FlexTM:
each thread’s transactional state needs to be disambiguated from other threads on the same core,
and conﬂicts need to be detected among these threads.
To disambiguate each thread’s transaction state, per-thread signatures, CSTs, AOU, and PDI
control registers must be included. Similarly, the alert bit per cache line must be replicated. Spec-
ulatively written state is more challenging—a cache block can buffer only a single thread’s specu-
lative write and a conventional L1 cache is allowed to buffer one copy of a cache block. To permit
each thread to cache speculatively modiﬁed data, we must include a thread id along with the “T”-
bit, and use it to indicate which thread speculatively wrote the speciﬁc block. We could then allow
each L1 set to buffer multiple versions of the same cache block, or alternatively, we could buffer
only a single version of the cache block in the L1 for one of the hardware threads and use the
overﬂow mechanisms discussed in Section 5 to maintain the other versions in each thread’s private
overﬂow region. Overﬂow handling must also be modiﬁed to allow per thread state.
Conﬂict detection between the hardware threads is challenging to handle because there are no
coherence accesses within a core. Fortunately, each hardware thread’s signature is maintained in
the same core and we can query the signatures of other threads to detect conﬂicts.
Snooping Protocols Accommodating broadcast snooping protocols within FlexTM is
straightforward. Consider a protocol in which the L1 cache broadcasts its requests to other L1s and
to the shared L2 cache on an ordered network. The additional state required by FlexTM remains
the same. The only change required is to the L1 cache response mechanism. Typically snooping
protocols implement response messages using a wired-OR signal that cannot identify the sharing
processors. We would need to include extra signal lines to encode the actual identities.
Nesting While FlexTM allows operations to escape transactional semantics with normal loads
and stores, it handles nested transactions via the subsumption model. The key challenge to sup-
porting true nesting is disambiguating between the hardware state (e.g., speculative lines in the
cache) of each nested level. We could include limited support for nesting using techniques like
split hardware transactions [26] but their performance overheads need further investigation.
4. Hardware/Software Interface
In this section, we discuss the interface provided by each FlexTM component, provide the pseudo-
code for the main TM runtime macros, and discuss their usage.
Tables 2 and 3 list the instructions and registers required by the AOU and PDI mechanisms.
AOU’s interface include special registers to hold the address of the user-mode handler and a de-
scription of the current alert; and instructions to set and unset the user-mode handler and to mark
and unmark cache lines (i.e., to set and clear their alert bits). PDI’s interface includes support
for speculative reads (TLoads) and writes (TStores), which are interpreted as speculative when the
10Registers
%aou_handlerPC: address of the handler to be called on a user-space alert
%aou_oldPC: PC immediately prior to the call to %aou_handlerPC
%aou_alertAddress: address of the line whose status change caused the alert
%aou_alertType: remote write, lost alert, or capacity/conﬂict eviction
Instructions
set_handler %r move %r into %aou_handlerPC
clear_handler clear %aou_handlerPC and ﬂash-clear the alert bits for all cache lines
aload [%r1], %r2 load the word at address %r1 into register %r2, and set the alert bit(s)
for the corresponding cache line
arelease %r unset the alert bit for the cache line that corresponds to the address in
register %r
arelease_all ﬂash-clear alert bits on all cache lines
Table 2: Alert-on-Update Software Interface.
Registers
%t_in_flight: a bit to indicate that a transaction is currently executing
Instructions
begin_t set the %t_in_flight register to indicate the start of a transaction
tstore [%r1], %r2 write the value in register %r2 to the word at address %r1; isolate the
line (TMI state)
tload [%r1], %r2 read the word at address %r1, place the value in register %r2, and tag
the line as transactional
abort discard all isolated (TMI or TI) lines; clear all transactional tags and reset
the %t_in_flight register
cas-commit [%r1], %r2, %r3 compare %r2 to the word at address %r1; if they match, commit all
isolated writes (TMI lines) and store %r3 to the word; otherwise discard
all isolated writes; in either case, clear all transactional tags, discard all
isolated reads (TI lines), and reset the %t_in_flight register
Table 3: Programmable-Data-Isolation software interface.
hardware transaction bit (%hardware_t) is set. CAS-Commit enables the software runtime to
couple the logical commit of the transaction in software with the committing of the speculative
hardware state. The CAS-Commit instruction performs the usual function of compare-and-swap.
In addition, if the CAS succeeds, TMI lines revert to M, making their data visible to other readers
through normal coherence actions. If the CAS fails, the buffered state remains in the local cache
for software to handle appropriately. Buffered state can be eliminated and coherence restored by
issuing an Abort.
CSTs and signatures are treated as registers that can be loaded from and stored to memory.
Software can set them up to operate in two modes (Eager or Lazy) for each types of conﬂict (W-W
or W-R or R-W). In eager mode, a conﬂicting coherence message updates the CST and triggers
the handler; in Lazy mode, a conﬂict updates the CST but does not trigger the handler — software
reads the CSTs when it desires.
4.1. Bounded Transactions
In this section, we discuss the execution of transactions with bounded access sets that ﬁt within
an OS quantum. We assume a subsumption model for nesting, with support for transactional
11pause [47], which suspends a transaction in order to perform non-transactional activity.
Name Description
TSW active / committed / aborted
State running / suspended
Rsig, W sig Signatures
R-W, W-R, W-W Conﬂict Summary Tables
OT Pointer to Overﬂow Table descriptor
AbortPC Handler address for AOU on TSW
CMPC Handler address for Eager conﬂicts
E/L Eager(1)/Lazy(0) conﬂict resolution.
Table 4: Transaction Descriptor contents. All ﬁelds except TSW and State are cached in hardware
registers for transactions running.
A FlexTM transaction is represented by a software descriptor (Table 4). This descriptor includes
a status word, space for buffering the hardware state when paused (CSTs, Signatures, and Overﬂow
control registers), pointer to the abort (AbortPC) and contention management handlers (CMPC), and
a ﬁeld to specify the conﬂict resolution mode of the transaction.
A transaction is delimited by BEGIN_TRANSACTION and END_TRANSACTION macros (see
Figure 3). BEGIN_TRANSACTION establishes the conﬂict and abort handlers for the transaction,
checkpoints the processor registers, conﬁgures per-transaction metadata, sets the transaction status
word (TSW) to active, and ALoads that word (for notiﬁcation of aborts). Some of these opera-
tions are not intrinsically required and can be set up for the entire lifetime of a thread (e.g., AbortPC
and CMPC). END_TRANSACTION aborts conﬂicting transactions and tries to atomically update
the status word from active to committed using CAS-Commit.
Within a transaction, the processor issues TLoads and TStores when it expects transactional
semantics, and conventional loads and stores when it wishes to bypass those semantics. TLoads
and TStores are interpreted as speculative when the hardware transaction bit (%hardware_t) is
set. This convention facilitates code sharing between transactional and non-transactional program
fragments. Ordinary loads and stores can be requested within a transaction; these could be used to
implement escape actions, update software metadata, or reduce the cost of thread-private updates
in transactions that overﬂow cache resources. In order to avoid the need for compiler generation of
the TLoads and TStores, our prototype implementation follows typical HTM practice and interprets
ordinary loads and stores as TLoads and TStores when they occur within a transaction.
Transactions of a given application can employ either Eager or Lazy conﬂict resolution. In
Eager mode, when conﬂicts appear through response messages (i.e., Threatened and Exposed-
Read), the processor effects a subroutine call to the handler speciﬁed by CMPC. The conﬂict
manager either stalls the requesting transaction or aborts one of the conﬂicting transactions. The
remote transaction can be aborted by atomically CASing its TSW from active to aborted,
thereby triggering an alert (since the TSW is always ALoaded). FlexTM supports a wide variety of
conﬂict management policies (even policies that desire the ability to synchronously abort a remote
transaction). When an Eager transaction reaches its commit point, its CSTs will be empty, since
all prior conﬂicts will have been resolved. It attempts to commit by executing a CAS-Commit on its
TSW. If the CAS-Commit succeeds (replacing active with committed), the hardware ﬂash-
12commits all locally buffered (TMI) state. The CAS-Commit will fail leaving the buffered state local
if the CAS does not ﬁnd the expected value (a remote transaction managed to abort the committing
transaction before the CAS-Commit could complete).
In Lazy mode, transactions are not alerted into the conﬂict manager. The hardware simply
updates requestor and responder CSTs. To ensure serialization, a Lazy transaction must, prior to
committing, abort every concurrent transaction that conﬂicts with its write-set. It does so using the
END TRANSACTION() routine shown in Figure 3.
BEGIN TRANSACTION()
1. clear Rsig and W sig
2. set Abort PC
3. set CMPC
4. TSW[my id] = active
5. Aload(TSW[my id])
6. begin t
END TRANSACTION() /* Non-blocking, pre-emptible */
1. if (TSW[my id] == active) goto Abort PC
2. copy-and-clear W-R and W-Wregisters
3. foreach i set in W-R or W-W
4. abort id = manage conﬂict(my id, i)
5. if (abort id 6= NULL) // not resolved by waiting
6. CAS(TSW[abort id], active, aborted)
7. CAS-Commit(TSW[my id], active, committed)
8. if (TSW[my id] == active) // failed due to nonzero CST
9. goto 1
Figure 3: Pseudocode of BEGIN TRANSACTION and END TRANSACTION (for Lazy transac-
tions).
All of the work for the END TRANSACTION() routine occurs in software, with no need for
global arbitration [7, 9, 19], blocking of other transactions [19], or special hardware states. The
routine begins by using a copy and clear instruction (e.g., clruw on the SPARC) to atomically
access its own W-R and W-W. In lines 3–6 of Figure 3, for each of the bits that was set, transaction
T aborts the corresponding transaction R by atomically changing R’s TSW from active to
aborted. Transaction R, of course, could try to CAS-Commit its TSW and race with T, but
since both operations occur on R’s TSW, conventional cache coherence guarantees serialization.
After T has successfully aborted all conﬂicting peers, it performs a CAS-Commit on its own status
word. Ifthe CAS-Commit fails andthe failure can be attributedto a non-zero W-R or W-W (i.e., new
conﬂicts), the END TRANSACTION() routine is restarted. In the case of a R-W conﬂict, no action
is needed since T is the reader and is about to serialize before the writer (i.e., the two transactions
can commit concurrently). Software mechanisms can be used to disambiguate conﬂicts and avoid
spurious aborts when the writer commits.
The contention management policy (line 4) in the commit process is responsible for providing
variousprogressandperformanceguarantees. TheTMsystemcanchoosetopluginanapplication-
speciﬁc policy. For example, if we used a Timestamp manager [36], then it will ensure livelock
freedom. More recently, EazyHTM [45] has exploited CST-like bitmaps to accelerate a pure-
HTM’s commit, but does not allow pluggable policies. FlexTM’s commit operation is entirely in
software and its latency is proportional to the number of conﬂicting transactions — in the absence
of conﬂicts there is no overhead. Even in the presence of conﬂicts, aborting each conﬂicting
transaction consumes only the latency of a single CAS operation (at most a coherence operation).
134.2. Mixed Conﬂict Resolution
While Lazy conﬂict resolution generally provides the best performance with its ability to exploit
concurrency and ensure progress [41], it does introduce certain challenges. In particular, it requires
a multiple-writer and/or multiple reader protocol that makes notable additions to a basic MESI
protocol. Multiple L1’s need to be able to concurrently cache a block and read and write it (quite
different from the basic “S” and “M” states). This is a source of additional complexity over an
Eager system and could prove to be a barrier to adoption.
Furthermore, write-write conﬂicts need to be conservatively treated as dueling read-write and
write-read conﬂicts since conﬂicts are detected using coherence actions and a transaction that ob-
tains permissions to write a block can also read it (the read will not result in a coherence action).
It is therefore not possible to allow both transactions to concurrently commit (one of them has to
abort). While commit-time conﬂict resolution in Lazy mode does try to ensure forward progress by
ensuring that the winning transaction is one that is already ready to commit, for some workloads,
it could also lead to signiﬁcant levels of wasted work due to delayed aborts (see the results for
STMBench7 in our work from ICS’09 [41]).
To avoid the wasted work and to simplify the design, we extend FlexTM to support the Mixed-
mode conﬂict resolution [38, 41]. In Mixed mode, when write-write conﬂicts appear (a TStore
operation receives a threatened response), the processor effects a call to the contention manager.
On read-write or write-read conﬂicts, the hardware records the conﬂict in the CSTs and allows the
transaction to proceed. When the transaction reaches its commit point, it needs to take care of only
W-R conﬂicts (using an algorithm similar to Figure 3), as its W-W CST will be empty. Mixed mode
tries to save wasted work on write-write conﬂicts and to exploit the parallelism present in W-R and
R-W conﬂicts. However, it is also possible that since Mixed resolves write-write conﬂicts eagerly,
the transaction that wins the conﬂict and progresses will subsequently abort, thereby wasting work.
Mixed mode has more modest versioning requirements compared to Lazy mode. A system that
supports only Mixed mode and Eager mode can simplify the coherence protocol and overﬂow
mechanisms. Brieﬂy, Mixed maintains the single writer and/or multiple reader invariant: it al-
lows only one writer for a cache block (unlike Lazy mode) although the writer can co-exist with
concurrent readers (unlike Eager mode). At any given instant, there is only one speculative copy
accessed by the single writer and/or a non-speculative version accessed by the concurrent readers.
This simpliﬁes the design of the TMI state in the TMESI protocol. Only one of the L1 caches in
the system can have the line in TMI (not unlike the “M” state in MESI).
The more stream-lined version of FlexTM (FlexTM-S) that we evaluate supports only Mixed
and Eager modes. In Section 5.3:FlexTM-S we demonstrate another advantage of supporting only
Mixed and Eager modes: since they restrict the number of speculative writers to at most one, writer
conﬂicts may be precisely identiﬁed.
4.3. Strong Isolation
As implied in Figure 1, transactional and ordinary loads and stores to the same location can occur
concurrently. While we are disinclined to require strong isolation [4] as part of the user pro-
gramming model (it’s hard to implement on legacy hardware, and is of questionable value to the
programmer [12]), it can be supported at essentially no cost in HTM systems (FlexTM among
them), and we see no harm in providing it. If the GETX request resulting from a non-transactional
14write miss hits in the responder’s Rsig or W sig, it aborts the responder’s transaction, so the non-
transactional write appears to serialize before the (retried) transaction. A non-transactional read,
likewise, serializes before any concurrent transactions, because transactional writes remain invisi-
ble to remote processors until commit time (in order to enforce coherence, the corresponding cache
line, which is threatened in the response, is uncached).
5. Unbounded Space Support
For common case transactions that do not overﬂow the cache, signatures, CSTs, and PDI avoid
the need for logging or other per-access software overhead. To provide the illusion of unbounded
space, however, FlexTM must provide mechanisms to handle transactional state evicted from the
L1 cache. Cache evictions must be handled carefully. First, signatures rely on forwarded re-
quests from the directory to trigger lookups and provide conservative conﬂict hints (Threatened
and Exposed-Read messages). Second, TMI lines holding speculative values need to be buffered
and cannot be merged into the shared level of the cache. We ﬁrst describe our approach to han-
dling coherence-based conﬂict detection for evicted lines, followed by two alternative schemes for
versioning of evicted TMI lines.
5.1. Eviction of Transactionally Read Lines
Conventional MESI performs silent eviction of E and S lines to avoid the bandwidth overhead of
notifying the directory. In FlexTM, silent evictions of E, S, and TI lines also serve to ensure that
a processor continues to receive the coherence requests it needs to detect conﬂicts. (Directory
information is updated only in the wake of L1 responses to L2 requests, at which point any conﬂict
is sure to have been noticed.) When evicting a cache block in M, FlexTM updates the L2 copy but
does not remove the processor from the sharer list if there is a hit in the local signature. Processor
sharer information can, however, be lost due to L2 evictions. To preserve the access conﬂict
tracking mechanism, L2 misses result in querying all L1 signatures in order to recreate the sharer
list. This scheme is much like the sticky bits used in LogTM [31].
5.2. Overﬂow table (OT) Controller Design
FlexTM employs a per-thread overﬂow table (OT) to buffer evicted TMI lines. The OT is organized
as a hash table in virtual memory. It is accessed both by software and by an OT controller that sits
on the L1 miss path. The latter implements (1) fast lookups on cache misses, allowing software to
be oblivious to the overﬂowed status of a cache line, and (2) fast cleanup and atomic commit of
overﬂowed state.
The controller registers required for OT support appear in Figure 2. They include a thread
identiﬁer, a signature (Osig) for the overﬂowed cache lines, a count of the number of such lines, a
committed/speculative ﬂag, and parameters (virtual and physical base address, number of sets and
ways) used to index into the table.
On the ﬁrst overﬂow of a TMI cache line, the processor traps to a software handler, which al-
locates an OT, ﬁlls the registers in the OT controller, and returns control to the transaction. To
minimize the state required for lookups, the OT controller requires the OS to ensure that OTs of
active transactions lie in physically contiguous memory. If an active transaction’s OT is swapped
out, then the OS invalidates the Base-Address register in the controller. If subsequent activity re-
15quires the OT, the hardware traps to a software routine that re-establishes a mapping. The hardware
needs to ensure that new TMI lines aren’t evicted during OT set-up; the L1 cache controller could
support this by ensuring that at least one entry in the set is free for non-TMI lines.
On a subsequent TMI eviction, the OT controller calculates the set index using the physical
address of the line, accesses the set tags of the OT region to ﬁnd an empty way, and writes the data
block back to the OT instead of the L2. The controller tags the line with both its physical address
(used for associative lookup) and its virtual address (used to accommodate page-in at commit
time; see below). The controller also adds the physical address to the overﬂow signature (Osig) and
increments the overﬂow count.
The Osig summarizes the entries in the OT and provides quick lookaside for L1 misses. All L1
misses check the Osig and hits trigger an OT lookup in parallel with the L2 access. If the block is
found in the OT, the hardware fetches it and overrides the L2 response. Osig is also needed to ensure
atomic write-back of data buffered in the OT. When a transaction commits, it exposes the Osig to
forwarded coherence requests. Coherence requests that hit in the Osig indicate that buffered data
in the OT are possibly being copied back to the original location. The response to the request can
be dealt with in two ways: the controller could either perform lookup in the OT and respond with
the data block or it could NACK the request until copyback completes; our current implementation
does the latter.
In addition to functions previously described, the CAS-Commit operation sets the Committed
bit in the controller’s OT state. This indicates that the OT content should be visible, activating
NACKs or lookups. At the same time, the controller initiates a microcoded copyback operation.
To accommodate page evictions of the original locations, OT tags include the virtual addresses
of cache blocks. These addresses are used during copyback to ensure automatic page-in of any
nonresident pages.
There are no constraints on the order in which lines from the OT are copied back to their natural
locations. This stands in contrast to time-based logs [31], which must proceed in reverse order of
insertion. Remote requests need to check only committed OTs (since speculative lines are private)
and for only a brief span of time (during OT copy-back). On aborts, the OT is reclaimed, to be
cleaned up for use by another transaction. The next overﬂowed transaction allocates a new OT.
When an OT overﬂows a way, the hardware generates a trap to the OS, which expands the OT
appropriately.
Although we require that OTs be physically contiguous for simplicity, they can themselves be
paged. A more ambitious FlexTM design could allow physically non-contiguous OTs, with con-
troller access mediated by more complex mapping information. With the addition of the OT con-
troller, software is involved only for the allocation and deallocation of the OT structure. Indirection
to the OT on misses, while unavoidable, is performed in hardware rather than in software, thereby
reducing the resulting overheads. Furthermore, FlexTM’s copyback is performed by the controller
and occurs in parallel with other useful work on the processor.
5.3. Software Metadata Cache (SM-Cache) Approach
The OT controller mechanism just described consists of a hardware state machine that maintains
a write buffer (organized as a hash table) in a software-allocated region. There is implementation
complexity associated with the state machine that searches (writes back and reloads) and accesses
16data blocks without any help from software.
In this section, we propose a more streamlined mechanism used in the FlexTM-S design. We
move the actions of maintaining the data structure and performing the redo on commit to software,
replacing the hash table with buffer pages and introducing a metadata cache that enables hardware
to access the buffer pages without software intervention. Figure 4 shows the per-page software
metadata, which speciﬁes the buffer-page address and for each cache block, the writer transaction
id (Tx id) and a “V/I” bit to indicate if the buffer block is buffering valid data. To convey the
metadata information to hardware and accelerate repeated block accesses, we install a metadata
cache (SM-cache) on the L1 miss path (see Figure 5).
Tx Id Buffer 
V. addr V/I Data
V. addr.
Page-Granularity Metadata
1st cache line Nth cache line
Writer Tx and
Valid bit
Tx Id V/I ...
Data V. addr is the virtual page number of the original location. Buffer V. addr is the virtual
page number of the buffer page. The (Tx id, V/I) pair denote the writer transaction’s id and
the validity of the buffered cache block. An array of (Tx Id, V/I) pairs is associated with the
page. The (Tx id, V/I) denote the following semantics when accessed by transaction T; (don’t
care,0): buffer-page cache block empty. (X,1) X6=T: T conﬂicts with writer transaction X;
(T,1): T has speculatively written the block and evicted it.
Figure 4: Metadata for pages that have overﬂowed state.
Processor Core
Tag State A T Data
Private L1 Cache
Tag1 Data Tag2
Metadata 
P. addr
Data 
V. addr
Metadata
SM-Cache
Insert / 
Remove entry
Coherence
Overﬂow Wsig
L1 miss /
writeback
Figure 5: Simplifed overﬂow support with SM-cache. Dashed lines surround new extension that
replaces the OT controller (see Figure 2).
When a speculatively written cache line is evicted, the cache controller looks up the SM-cache
for the metadata and uses the buffer page address to index into the TLB (for the buffer page’s phys-
ical address2) for writeback redirection. Multiple transactions that are possibly writing different
2Virtual page synonyms are cases where multiple virtual pages point to the same physical frame and a thread can
17cache blocks on the same page can share the same buffer page. A miss in the SM-cache triggers
a software handler that allocates the buffer page metadata and reloads the SM-cache. To provide
the commit handler with the virtual address of the cache block to be written back, every SM-cache
entry includes this information and is virtually indexed (note that the data cache is still physically
indexed). While the entire buffer page is allocated when a single cache block in the original page
is evicted, the individual buffer page cache blocks are used only as and when further evictions oc-
cur. This ensures that the overﬂow mechanism adds overhead proportional to the number of cache
blocks that are evicted (similar to the OT controller mechanism). In contrast to this design, other
page-based overﬂow mechanisms (e.g., XTM [11] and PTM [10]) clone the entire page if at least
a single cache block on the page is evicted.
With data buffered, L1 misses now need to ensure that data is obtained from the appropriate
location (buffer page or original). As in the OT controller design, we use an overﬂow signature
(Osig) to summarize addresses of evicted blocks and elide metadata checks. L1 misses check the
Osig, and signature hits require a metadata check. If the metadata indicates that transaction T
accessing the location had written the block (i.e., V/I bit is 1 and Tx id=T), then hardware fetches
the buffer block and overrides the L2 response. It also unsets the V/I bit to indicate that the buffer
block is no longer valid (block is present in the cache). Otherwise, the coherence response message
dictates the action. On eviction of a speculatively written cache line that another transaction has
writtenandoverﬂowedaswell(i.e., V/Ibitis1andTx id=X,X6=T),ahandlerisinvokedthateither
allocates a new buffer page and reﬁlls the SM-cache or resolves the conﬂict immediately. The
former design supports multiple writers to the same location (and enables Lazy conﬂict resolution),
while the latter forces eager write-write conﬂict resolution, but enables a simpler design. The Tx id
ﬁeld supports precise detection of writer conﬂicts (see the FlexTM-S design below).
When a transaction commits, it copy-updates the original locations using software routines. To
ensure atomicity, the transaction updates its status word to inform concurrent accesses to hold off
until the copy-back completes. It then iterates through the metadata of the various buffer pages in
the working set and copies back the cache blocks that it has written.
SM-Cache The SM-cache stores metadata that hardware can use to accelerate block access
and cache evictions without software intervention. It resides on the L1 miss path. On an Osig hit
the SM-cache is looked up in parallel with the L2 lookup (see Figure 5). SM-cache misses are
handled entirely by software handlers that index into it using the virtual page address. The L1
controller also uses a similar technique to obtain metadata for redirecting evictions and reloads.
The metadata may be concurrently updated if different speculative cache blocks in the page are
evicted at multiple processor sites. To ensure metadata consistency, the SM-cache participates in
coherence using the physical address of the metadata. This physical address tag is inaccessible
to software and is automatically ﬁlled by the hardware when an entry is allocated. The dual-
tagging of the SM-cache introduces the possibility that the two tags (virtual address of page and
physical address of metadata) might not map to the same set index. We solve this with tag array
pointers [17].
access the same location with different virtual addresses. To resolve these, since software knows about the pages that
are synonyms, it ensures that the SM-cache is loaded with the same metadata for all the virtual synonym pages.
18FlexTM-S To evaluate the performance of the SM-cache approach, we developed FlexTM-S.
For bounded transactions, it leverages the hardware presented in Section 3, but it omits support for
Lazy conﬂict resolution.
Compared to FlexTM, FlexTM-S (1) simpliﬁes hardware support for the versioning mechanism
by trading in FlexTM’s overﬂow hardware controller for an SM-cache (software metadata cache)
and (2) allows precise detection of conﬂicting writers. By restricting support to Mixed and Eager
modes, i.e., allowing only one speculative writer, the coherence protocol is also simpliﬁed.
To ensure low overhead for detecting conﬂicting readers, FlexTM-S uses the Rsig for both over-
ﬂowed and cached state. To identify writer transactions, it uses a two-level scheme: if the specu-
lative state resides in the cache, the response message from the conﬂicting processor identiﬁes the
transaction (the CST bits will identify the conﬂicter’s id). If the speculative state has been evicted
then the Osig membership tests will indicate the possibility of a conﬂict. This type of conﬂict is also
encoded in the response message. If an Osig conﬂict is indicated, the requester checks the metadata
for precise disambiguation, thereby eliminating false positives. Since a block can be written by
only one transaction (Mixed/Eager invariant), the Tx id in the metadata precisely identiﬁes the
writer. If the metadata indicates no conﬂict, software loads the SM-cache instructing hardware to
ignore the Osig response and allows the transaction to proceed. Thus the metadata for versioning
helps to disambiguate writer transactions which (1) helps identify the conﬂicting writer precisely
and (2) allows progress of non-conﬂicting transactions, which would have otherwise required con-
tention management (in Eager mode) due to signature false-positives.
5.4. Handling OS Page Evictions
The two challenges left to consider are (1) eviction of a page from physical memory and reuse of
its frame for a different page in the application, and (2) when a page is re-mapped to a different
frame. Since signatures are built using physical addresses, (1) can lead to false positives, which
can cause spurious aborts but not correctness issues. In a more ambitious design, we could address
these challenges with virtual address-based conﬂict detection for non-resident pages.
For (2) we adapt a solution ﬁrst proposed in LogTM-SE [46]. At the time of the unmap, active
transactions are interrupted both for TLB entry shootdown (already required) and to ﬂush TMI
lines to the OT. When the page is assigned to a new frame, the OS interrupts all the threads that
mapped the page and tests each thread’s Rsig, W sig, and Osig for the old address of each block. If
the block is present, the new address is inserted into the signatures. Fortunately, since a typical
page (4/8KB) contains only about 64–128 cache lines, this doesn’t impose signiﬁcant overhead
compared to the cost of page eviction. Finally, there are differences in the support required from
the paging mechanism for the OT controller approach and the SM-Cache approach. The former
indexes into the overﬂow table using the physical address and requires the paging mechanism to
update the tags in the table entries with the new physical address. The latter needs no additional
support since it uses the virtual address of the buffer page and at the time of writeback indexes into
the TLB to obtain the current physical address.
6. Context Switch Support
STMs provide effective virtualization support because they maintain conﬂict detection and ver-
sioning state in virtualizable locations and use software routines to manipulate them. For common
19case transactions, FlexTM uses scalable hardware support to bookkeep the state associated with
access permissions, conﬂicts, and versioning while controlling policy in software. In the presence
of context switches, FlexTM detaches the transactional state of suspended threads from the hard-
ware and manages it using software routines. This enables support for transactions to extend across
context switches (i.e., to be unbounded in time [1]).
Ideally, only threads whose accesses overlap with the read and write set of suspended trans-
actions should bear the software routine overhead. Both FlexTM and FlexTM-S handle context
switches in a similar manner. To remember the accesses of descheduled threads, FlexTM main-
tains two summary signatures, RSsig and WSsig, at the directory of the system. When suspending a
thread in the middle of a transaction, the OS unions (i.e., ORs) the signatures (Rsig and W sig) of the
suspended thread into the current RSsig and WSsig installed at the directory. 3
Once the RSsig and WSsig are up to date, the OS invokes hardware routines to merge the current
transaction’s hardware state into virtual memory. This hardware state consists of (1) the TMI lines
in the local cache, (2) the overﬂow hardware registers, (3) the current Rsig and W sig, and (4) the
CSTs. After saving this state (in the order listed), the OS issues an abort instruction, causing
the cache controller to revert all TMI and TI lines to I, and to clear the signatures, CSTs, and
overﬂow controller registers. This ensures that any subsequent conﬂicting access will miss in the
L1 cache and generate a directory request. In other words, for any given location, the ﬁrst conﬂict
between the running thread and a local descheduled thread always results in an L1 miss. The
L2 controller consults the summary signatures on each such miss, and traps to software when
a conﬂict is detected. A TStore to a line in M state generates a write-back (see Figure 1) that
also tests the RSsig and WSsig for conﬂicts. This resolves the corner case in which a suspended
transaction TLoaded a line in M state and a new transaction on the same processor TStores it.
On summary signature hits, a software handler mimics hardware operations on a per-thread ba-
sis, testing signature membership and updating the CSTs of suspended transactions. When using
the SM-cache design, the software metadata from versioning can be used to precisely identify the
writer conﬂict. No special instructions are required, since the CSTs and signatures of descheduled
threads are all visible in virtual memory. Nevertheless, updates need to be performed atomically to
ensure consistency when multiple active transactions conﬂict with a common descheduled transac-
tion and update the CSTs concurrently. The OS helps the handler distinguish among transactions
running on different processors. It maintains a global conﬂict management table (CMT), indexed
by processor id, with the following invariant: if transaction T is active, and has executed on pro-
cessor P, irrespective of the state of the thread (suspended/running), the transaction descriptor
will be included in P’s portion of the CMT. The handler uses the processor ids in its CST to in-
dex into the CMT and to iterate through transaction descriptors, testing the saved signatures for
conﬂicts, updating the saved CSTs (if running in lazy mode), or invoking conﬂict management
(if running in eager mode). Similar perusal of the CMT occurs at commit time if running in lazy
mode. As always, we abort a transaction by writing its TSW. If the remote transaction is running,
3FlexTM updates RSsig and WSsig using a Sig message that uses the L1 coherence request network to write the
uncached memory-mapped registers. The directory updates the summary signatures and returns an ACK on the for-
warding network. This avoids races between the ACK and remote requests that were forwarded to the suspending
thread/processor before the summary signatures were updated.
20an alert is triggered since it would have previously ALoaded its TSW. Otherwise, the OS virtual-
izes the AOU operation by causing the transaction to wake up in a software handler that checks
and re-ALoads the TSW.
The directory needs to ensure that sticky bits are retained when a transaction is suspended.
Along with RSsig and WSsig, the directory maintains a bitmap indicating the processors on which
transactions are currently descheduled (the “Cores Summary” register in Figure 2). When the
directory would normally remove a processor from the sharers list (because a response to a coher-
ence request indicates that the line is no longer cached), the directory refrains from doing so if the
processor is in the Cores Summary list and the line hits in RSsig or WSsig. This ensures that the
L1 continues to receive coherence messages for lines accessed by descheduled transactions. It will
need these messages if the thread is switched back in, even if it never reloads the line.
When re-scheduling a thread, if the thread is being scheduled back to the same processor from
which it was switched out, the thread’s Rsig, W sig, CST, and OT registers are restored on the proces-
sor. The OS then re-calculates the summary signatures that correspond to the currently switched
out threads with active transactions and re-installs them at the directory. Thread migration is a
little more complex, since FlexTM performs write buffering and does not re-acquire ownership of
previously written cache lines. To avoid the inherent complexity, FlexTM adopts a simple policy
for migration: abort and restart.
Unlike LogTM-SE [46], FlexTM is able to place the summary signature at the directory rather
than on the path of every L1 access. This avoids the need for interprocessor interrupts to install
summary signatures. Since speculative state is ﬂushed from the local cache when descheduling a
transaction, the ﬁrst access to a conﬂicting line after rescheduling is guaranteed to miss, and the
conﬂict will be caught by the summary signature at the directory. Because it is able to abort remote
transactions using AOU, FlexTM also avoids the problem of potential convoying behind suspended
transactions.
7. Area Analysis
In this section, we brieﬂy summarize the area overheads of FlexTM. Further details can be found in
a technical report [40]. Area estimates appear in Table 7. We consider processors from a uniform
(65nm) technology generation to better understand microarchitectural tradeoffs. Processor com-
ponent sizes were estimated using published die images. FlexTM component areas were estimated
using CACTI 6.
Only for the 8-way multithreaded Niagara-2 do the Rsig and W sig have a noticeable area impact:
2.2%; on Merom and Power6 they add only 0.1%. CACTI indicates that the signatures should
be readable and writable in less than the L1 access latency. These results appear to be consistent
with those of Sanchez et al. [35]. The CSTs for their part are full-map bit-vector registers (as wide
as the number of processors), and we need only three per hardware context. We do not expect the
extra state bits in the L1 to affect the access latency because (a) they have minimal impact on the
cache area and (b) the state array is typically accessed in parallel with the higher latency data array.
Finally, we compare the OT controller to the metadata cache (SM-cache) approach. While the
SM-cache is signiﬁcantly more area hungry than the controller, it is a regular memory structure
rather than a state machine. The SM-cache needs a separate hardware cache to store the metadata
21while the OT controller’s metadata (i.e., hash-table index entries) contend with regular data for
L2 cache space. Overall, the OT controller adds less than 0.5% to core area. Its state machine
is similar to Niagara-2’s Translation-Storage-Buffer walker [48]. Niagara-2 with its 16-byte data
cache line presents a worst-case design point for the SM-cache. The small cache line leads to high
overhead in page-level metadata, since there are more cache blocks per page (4 more than Merom
or Power6) and per-cache line metadata, since the per-cache line entry (17 bits) is a signiﬁcant
fraction of cache line size (16 bytes). Straightforward optimizations that would save area include
organizing the metadata to represent a larger than cache line region.
Overall, with either FlexTM (which includes the OT controller) or FlexTM-S (which includes
the SM-cache) the overheads imposed on out-of-order CMP cores (Merom and Power6) are well
under 1-2%. In the case of Niagara-2 (high core multithreading and small cache lines), FlexTM
add-ons require a 2.6% area increase while FlexTM-S’s add-ons require a 10% area increase.
Table 5: Area Estimation.
Processor Merom [34] Power6 [16] Niagara-2 [48]
Actual Die
SMT (threads) 1 2 8
Core (mm2) 31.5 53 11.7
L1 D (mm2) 1.8 2.6 0.4
CACTI Prediction
Rsig + W sig (mm2) .033 .066 0.26
RSsig + WSsig (mm2) .033 .033 0.033
CSTs (registers) 3 6 24
Extra state bits 2 (TA) 3 (TA,ID) 5 (TAID)
% Core increase 0.6% 0.59% 2.6%
% L1 Dcache increase 0.35% 0.29% 3.9%
OT controller (mm2) 0.16 0.24 0.035
32 entry SM-Cache (mm2) 0.27 0.96
ID — SMT context of ‘TMI’ line
8. FlexTM Evaluation
8.1. Evaluation Framework
We evaluate FlexTM through full system simulation of a 16-way chip multiprocessor (CMP) with
private L1 caches and a shared L2 (see Table 6(a)), on the GEMS/Simics infrastructure [29]. We
added support for the FlexTM instructions using the standard Simics “magic instruction” interface.
Our base protocol is an adaptation of the SGI ORIGIN 2000 [25] for a CMP, extended to support
FlexTM’s requirements: signatures, CSTs, PDI, and AOU. Software routines (setjmp) were
used to checkpoint registers.
Simics allows us to run an unmodiﬁed Solaris 9 kernel. Simics also provides a “user-mode-
change” and “exception-handler” interface, which we use to trap user-kernel mode crossings.
On crossings, we suspend the current transaction mode and allow the OS to handle TLB misses,
register-window overﬂow, and other kernel activities required by an active user context in the midst
of a transaction. On transfer back from the kernel, we deliver any alert signals received during the
kernel routine, triggering the alert handler if needed.
We evaluate FlexTM using the seven benchmarks listed in Table 6(b). Workload set 1 is a
set of microbenchmarks obtained from the RSTM package [49] and Workload set 2 consists of
22Table 6: Experimental setup.
(a) Target System Parameters.
16-way CMP, Private L1, Shared L2
Processor Cores 16 1.2GHz in-order,
single issue;
non-memory IPC=1
L1 Cache 32KB 2-way split,
64-byte blocks, 1 cycle,
32 entry victim buffer,
2Kbit signature [7, S14]
L2 Cache 8MB, 8-way, 4 banks,
64-byte blocks, 20 cycle
Memory 2GB, 250 cycle latency
Interconnect 4-ary tree, 1 cycle,
64-byte links,
Central Arbiter (Section 8.3)
Arbiter Lat. 30 cycles [8]
Commit Msg. Lat. 16 cycles/link
Commit messages also use the 4-ary tree.
(b) Workload Description.
Benchmark Inst/tx Wrset Rdset CST
conﬂicts
per-tx
Avg.
per-tx
W-W
Avg.
per-tx
R-W
Workload Set 1
HashTable 110 2 5 0 0 0
RBTree 1000 3 25 1 0 1.8
LFUCache 125 1 2 6 0.8 0.8
RandomGraph 11K 9 60 5 0.6 3
Workload Set 2
Bayes 70K 150 225 3 0 1.7
Delaunay 12K 20 83 1 0.10 1.1
Genome 1.8K 9 49 0 0 0
Intruder 410 14 41 2 0 1.4
Kmeans 130 4 19 0 0 0
Labyrinth 180K 190 160 3 0 2
Vacation 5.5K 12 89 1 0 1.6
STMBench7 105K 110 490 1 0.1 1.1
Setup: 16 threads with lazy conﬂict resolution; Inst/Tx- Instructions per transaction.
K-Kilo
Wrset(Rdset): Number of written (read) cache lines
CST conﬂicts per tx: Number of CST bits set. Median number of conﬂicting transac-
tions encountered
Average per-tx W-W (R-W): - Avg. number of common locations between pair-wise
conﬂicting transactions.
Table 7: Workload Inputs.
Benchmark Inputs
Workload Set 1
HashTable 1/3rd lookup, 1/3rd insert, 1/3rd delete
RBTree 1/3rd lookup, 1/3rd insert, 1/3rd delete
LFUCache 100% insert operation
RandomGraph 1/3rd lookup, 1/3rd insert, 1/3rd delete
Workload Set 2
Bayes -v32 -r1024 -n2 -p20 -s0 -i2 -e2
Delaunay -a20 -i inputs/633.2
Genome -g256 -s16 -n16384
Intruder -a10 -l16 -n4096 -s1
Kmeans -m10 -n10 -t0.05 -i inputs/random2048-d16-c16.txt
Labyrinth -i random-x48-y48-z3-n64
Vacation -n4 -q45 -u90 -r1048576 -t4194304
STMBench7 Reads-60%, Writes-40%. Short Traversals-40%. Long Traversals 5%, Ops. - 45%, Mods. 10%
applications from STAMP [30]4 and STMBench7 [18]. Kmeans and Labyrinth spend 60—65% of
their time in transactions; for all other applications spend over 98% of time in transactions. In the
microbenchmark tests, we execute a ﬁxed number of transactions in a single thread to warm up
the structure, then fork off threads to perform the timed transactions. For the STAMP workloads
and STMBench7 we use the input setup described in Table 7. In Bayes and Labyrinth we added
padding to a few data structures to eliminate frequent false conﬂicts.
As Table 6(b) shows, the workloads we evaluate have varied dynamic characteristics. Delaunay
and Genome perform a large amount of work per memory access and represent workloads in
4We left out SSCA since it did not exercise the TM components. It has small transactions and small working sets
and is highly data parallel.
23which time spent in the TM runtime is small compared to overall transaction latency. Kmeans
is essentially data parallel and along with the HashTable microbenchmark represents workloads
that are highly scalable with no noticeable level of conﬂicts. Intruder also has small transactions
but there is a high level of conﬂicts due to the presence of dueling write-write conﬂicts. The
short transactions in HashTable, KMeans, and Intruder suggest that TM runtime overheads (if
any) may become a signiﬁcant fraction of total transaction latency. LFUCache and Randomgraph
have a large number of conﬂicts and do not scale; any pathologies introduced by the TM runtime
itself [6] are likely to be exposed. Bayes, Labyrinth, and Vacation have moderate working set sizes
and signiﬁcant levels of read-write conﬂicts due to the use of tree-like data structures. RBTree
is a microbenchmark version of Vacation. STMBench7 is the most sophisticated application in
our suite. It has a varied mix of large and small transactions with varying types and levels of
conﬂicts [18].
Evaluation Dimensions We have designed the experiments to address the following questions
 How does FlexTM perform relative to hybrid TMs, hardware-accelerated STMs, and STMs?
 How does FlexTM’s CST-based parallel commit compare to a centralized hardware arbiter
design?
 How do the virtualization mechanisms deployed in FlexTM and FlexTM-S compare to pre-
viously proposed software instrumentation (SigTM [30]) and virtual memory-based imple-
mentations [11]?
8.2. FlexTM vs. Hybrid TMs and STMs
Result 1: Separable hardware support for conﬂict detection, conﬂict tracking, and versioning can
provide signiﬁcant acceleration for software controlled TMs; eliminating software bookkeeping
from the common case critical path is essential to realizing the full beneﬁts of hardware accelera-
tion.
Runtime systems We evaluate FlexTM and compare it against two different sets of Hybrid
TMs and STMs with two different sets of workloads.
Workload set 1 (WS1) interfaces with three TM systems: (1) FlexTM; (2) RTM-F [39], a hard-
ware accelerated STM system; and (3) RSTM [28], a non-blocking STM for legacy hardware
(conﬁgured to use invisible readers, with self validation for conﬂict detection). Workload set 2
(WS2), which uses a different API, interfaces with (1) FlexTM, (2) TL2, a blocking STM for
legacy hardware [14], and (3) SigTM [30], a hybrid TM derived from TL2 that uses hardware to
accelerate conﬂict detection. FlexTM, the hybrids (SigTM and RTM-F), and the STMs (RSTM
and TL2) have all been set up to perform Lazy conﬂict resolution.
We use the “Polka” conﬂict manager [36] in FlexTM, RTM-F, SigTM, and RSTM. TL2 limits
the choice of contention manager and uses a timestamp manager with backoff. While all runtime
systems execute on our simulated hardware, RSTM and TL2 make no use of FlexTM’s extensions.
RTM-F uses only PDI and AOU and SigTM uses only the signatures (Rsig and W sig). FlexTM uses
all the presented mechanisms. Average speedups reported are geometric means.
24Results Figure 6 shows the performance (transactions/sec) normalized to sequential thread
performance for 1 thread runs. This demonstrates that the overheads of FlexTM are minimal. For
small transactions (e.g., Hashtable) there is some overhead ('15%) for the checkpointing of pro-
cessor registers, which FlexTM performs in software — it could take advantage of checkpointing
hardware if it exists.
We study scaling and performance with 16 thread runs (Figure 7). To illustrate the usefulness of
CSTs (see the table in Figure 7), we also report the number of conﬂicts encountered and resolved
by an average transaction—the number of bits set in the W-R and W-W CST registers.
FlexTM RTM-F RSTM
(a) Workload Set 1
0
0.2
0.4
0.6
0.8
1
H
a
s
h
T
a
b
l
e
R
B
T
r
e
e
L
F
U
C
a
c
h
e
R
a
n
d
o
m
G
r
a
p
h
N
o
r
m
a
l
i
z
e
d
 
T
h
r
o
u
g
h
p
u
t
 
FlexTM SigTM TL2
(b) Workload Set 2
0
0.2
0.4
0.6
0.8
1
B
a
y
e
s
D
e
l
a
u
n
a
y
G
e
n
o
m
e
I
n
t
r
u
d
e
r
K
m
e
a
n
s
L
a
b
y
r
i
n
t
h
V
a
c
a
t
i
o
n
S
T
M
B
e
n
c
h
7
N
o
r
m
a
l
i
z
e
d
 
T
h
r
o
u
g
h
p
u
t
 
Å
Figure 6: Throughput (transactions/106 cycles), normalized to sequential thread. All performance
bars use 1 thread.
The performance of both STMs suffer from the bookkeeping required to track data versions,
detectconﬂicts, andguaranteeaconsistentviewofmemory(validation). RTM-FexploitsAOUand
PDI to eliminate validation and copying overhead, but still incurs bookkeeping that accounts for
40–50% of execution time. SigTM uses signatures for conﬂict detection but performs versioning
entirely in software. On average, the overhead of software-based versioning is smaller than that of
25FlexTM RTM-F RSTM
(a) Workload Set 1
0
2
4
6
8
10
H
a
s
h
T
a
b
l
e
R
B
T
r
e
e
L
F
U
C
a
c
h
e
R
a
n
d
o
m
G
r
a
p
h
N
o
r
m
a
l
i
z
e
d
 
T
h
r
o
u
g
h
p
u
t
 
10.1
FlexTM SigTM TL2
(b) Workload Set 2
0
2
4
6
8
10
12
B
a
y
e
s
D
e
l
a
u
n
a
y
G
e
n
o
m
e
I
n
t
r
u
d
e
r
K
m
e
a
n
s
L
a
b
y
r
i
n
t
h
V
a
c
a
t
i
o
n
S
T
M
B
e
n
c
h
7
N
o
r
m
a
l
i
z
e
d
 
T
h
r
o
u
g
h
p
u
t
 
Figure 7: Throughput (transactions/106 cycles), normalized to sequential thread. All performance
bars use 16 threads.
software-based conﬂict detection, but it still accounts for as much as 30% of execution time for on
some workloads (e.g.,STMBench7). Because it supports only lazy conﬂict detection, SigTM has
simpler software metadata than RTM-F. RTM-F tracks conﬂicts for each individual transactional
location and could varies the eagerness on a per-location basis.
FlexTM’s hardware tracks conﬂicts, buffers speculative state, and ensures consistency in a man-
ner transparent to software, resulting in single thread performance close to that of sequential thread
performance. FlexTM’s main overhead, register checkpointing, involves spilling of local registers
into the stack and is nearly constant across thread levels. Eliminating per-access software over-
heads (metadata tracking, validation, and copying) allows FlexTM to realize the full potential of
hardware acceleration, with an average speedup of 2 over RTM-F and 5.5 over RSTM on WS1.
On WS2, FlexTM has an average speedup of 1.7 over SigTM and 4.5 over TL2.
HashTable and RBTree both scale well and have signiﬁcant speedup over sequential thread
performance, 10.3 and 8.3 respectively. In RSTM, validation and copying account for 22%
26of execution time in HashTable and 50% in RBTree; metadata management accounts for 40%
and 30%, respectively. RTM-F manages to eliminate the validation cost and copying cost but
unfortunately the metadata management hinders performance improvement. FlexTM streamlines
transaction execution and provides 2.8 and 8.3 speedup over RTM-F and RSTM respectively.
LFUCache and RandomGraph do not scale (no performance improvement compared to sequen-
tial thread performance). In LFUCache, conﬂict for popular keys in the Zipf distribution forces
transactions to serialize. Stalled writers lead to extra aborts with larger numbers of threads, but per-
formance eventually stabilizes for all TM systems. In RandomGraph, larger numbers of conﬂicts
betweentransactionsupdatingthesameregioninthegraphcauseallTMsystemstoexperiencesig-
niﬁcant levels of wasted work. The average RandomGraph transaction reads 60 cache lines and
writes 9 cache lines. In RSTM, read-set validation accounts for 80% of execution time. RTM-F
eliminates this overhead, after which per-access bookkeeping accounts for 60% of execution time.
FlexTM eliminates this overhead as well, to achieve 2.7 the performance of RTM-F.
In applications with large access set sizes (i.e., Vacation, Bayes, Labyrinth, and STMBench7),
TL2 suffers from the bookkeeping required prior to the ﬁrst read (i.e., for checking write sets),
after each read, and at commit time (for validation) [14]. This instrumentation accounts for '40%
of transaction execution time. SigTM uses signatures-based conﬂict detection to eliminate this
overhead. Unfortunately, both TL2 and SigTM suffer from another source of overhead: given lazy
conﬂictresolution, readsneedtosearchtheredologtoseepreviouswritesbytheirowntransaction.
Furthermore, the software commit protocol needs to lock the metadata, perform the copyback,
and then release the locks. FlexTM eliminates the cost of versioning and conﬂict detection and
improves performance signiﬁcantly, averaging 2.1 speedup over SigTM and 4.8 over TL2.
Genome and Delaunay are workloads with a large ratio between the transaction size and the
number of accesses. TL2’s instrumentation on the reads does add signiﬁcant overhead and affects
its scalability—only 3.4 and 2.1 speedup (at 16 threads) over sequential thread performance
for Genome and Delaunay respectively. SigTM eliminates the conﬂict detection overhead and
signiﬁcantly improves performance—an average of 2.4 improvement over TL2. FlexTM, in spite
of the additional hardware support, improves performance by 22%, since the versioning overheads
account for a smaller fraction of overall transactional execution.
Finally, Kmeans and Intruder have unusually small transactions. Software handlers add sig-
niﬁcant overhead in TL2. In Kmeans, SigTM eliminates conﬂict detection overhead to improve
performance by 2.7 over TL2. Since the write sets are small, eliminating the versioning over-
heads in FlexTM only improves performance a further 24%. Intruder has a high level of conﬂicts,
and doesn’t scale well, with a 1.6 speedup for FlexTM over sequential thread performance (at
16 threads). Both SigTM and FlexTM eliminate the conﬂict detection handlers and streamline
the transactions, which leads to a change in the conﬂict pattern (fewer conﬂicts). This improves
performance signiﬁcantly—3.3 and 4.2 over TL2 for SigTM and FlexTM respectively. As in
Kmeans, theversioningoverheadsaresmallerandFlexTM’simprovementoverSigTMisrestricted
to 23%.
8.3. FlexTM vs. Central-Arbiter Lazy HTMs
Result 2: CSTs are useful: transactions don’t often conﬂict and even when they do the number of
conﬂicts per transaction is less than the total number of active transactions. FlexTM’s distributed
27commit demonstrates better performance than a centralized arbiter.
As shown in Table 6(b), the number of conﬂicts encountered by a transaction is small compared
to the total number of concurrent transactions in the system. Even in workloads that have a large
number of conﬂicts (LFUCache and RandomGraph) a transaction typically encounters conﬂicts
only about 30% of the time. Scalable workloads (e.g., Vacation, Kmeans) encounter essentially no
conﬂicts. This clearly suggests that global arbitration and serialized commits will not only waste
bandwidth but also restrict concurrency. CSTs enable local arbitration and the distributed commit
protocol allows parallel commits thereby unlocking the full concurrency potential of the applica-
tion. Also, a transaction’s commit overhead in FlexTM is not a constant, but rather proportional to
the number of conﬂicting transactions encountered.
In this set of experiments, we compare FlexTM’s distributed commit against two schemes with
centralized hardware arbiters: Central-Serial and Central-Parallel. In both schemes, instead of
using CSTs and requiring each transaction to ALoad its TSW, transactions forward their Rsig and
Wsig to a central hardware arbiter at commit time. The arbiter orders each commit request, and
broadcasts the Wsig to other processors. Every recipient uses the forwarded Wsig to check for con-
ﬂicts and abort its active transaction; it also sends an ACK as a response to the arbiter. The arbiter
collects all the ACKs and then allows the committing processor to complete. This process adds
97 cycles to a transaction, assuming unloaded links and arbiter (latencies are listed in Table 6(a)).
The Serial version services only one commit request at a time (queuing up any others); the Par-
allel services all non-conﬂicting transactions in parallel (assuming inﬁnite buffers in the arbiter).
Central arbiters are similar in spirit to BulkSC [8], but serve only to order commits; they do not
interact with the L2 directory.
We present results (see Figure 8) for all out workloads and enumerate the general trends below:
 Arbitration latency for the Central commit scheme is on the critical path of transactions.
This gives rise to noticeable overhead in the case of short transactions (e.g., HashTable, RBTree,
LFUCache, Kmeans, and Intruder). CSTs simplify the commit process: in the absence of conﬂicts,
a commit requires only a single memory operation on a transaction’s cached status word. On these
workloads, CSTs improve performance by an average of 25% even over the aggressive Central-
Parallel, which only serializes a transaction commit if it conﬂicts with an already in ﬂight commit.
 Workloads that exhibit inherent parallelism with Lazy conﬂict resolution (all except LFU-
Cache and RandomGraph) suffer from serialization of commits in Central-Serial. Central-Serial
essentially queues up transaction commits and introduces the commit latency of even other non-
conﬂicting transactions onto the critical path. The serialization of commits could also change the
conﬂict pattern. In some workloads (e.g., Intruder, STMBench7), in the presence of reader-writer
conﬂicts as the reader transaction waits for predecessors to release the arbiter resource, the reader
could be aborted by the conﬂicting writer. In a system that allows parallel commits the reader could
ﬁnish earlier and elide the conﬂict entirely. CST-based commit provides an average of '50% and
a maximum of 112% (HashTable) improvement over Central-Serial. Central-Parallel removes the
serialization overhead, but still suffers from commit arbitration latency.
 In benchmarks with high conﬂict levels (e.g., LFUCache and RandomGraph) that don’t in-
herently scale, Central’s conﬂict management strategy avoids performance degradation. The trans-
action being serviced by the arbiter always commits successfully, ensuring progress and livelock
freedom. The current distributed protocol allows the possibility of livelock. However, the CSTs
28streamline the commit process, narrow the vulnerability window (to essentially the interprocessor
message latency), and eliminate the problem as effectively as Central. Lazy conﬂict resolution
inherently eliminates livelocks as well. [41, 44]
At low conﬂict levels, a CST-based commit requires mostly local operations and its performance
should be comparable to an ideal Central-Parallel (i.e., zero message and arbitration latency).
At high conﬂict levels, the penalties of Central are lower compared to the overhead of aborts
and workload inherent serialization. Finally, the inﬂuence of commit latency on performance is
dependent on transaction latency (e.g., reducing commit latency helps Central-Parallel approach
FlexTM’s throughput in HashTable but has negligible impact on RandomGraph’s throughput).
FlexTM Central-Parallel Central-Serial
0
2
4
6
8
10
12
B
a
y
e
s
D
e
l
a
u
n
a
y
G
e
n
o
m
e
I
n
t
r
u
d
e
r
K
m
e
a
n
s
L
a
b
y
r
i
n
t
h
V
a
c
a
t
i
o
n
S
T
M
B
e
n
c
h
7
H
a
s
h
T
a
b
l
e
R
B
T
r
e
e
L
F
U
C
a
c
e
R
a
n
d
o
m
G
r
a
p
h
N
o
r
m
a
l
i
z
e
d
 
T
h
r
o
u
g
h
p
u
t
 
(
1
 
t
h
r
e
a
d
=
1
)
Figure 8: FlexTM vs. Centralized hardware arbiters.
8.4. FlexTM-S vs. Other Virtualization Mechanisms
TostudyTMvirtualizationmechanisms, wedowngradeourprivateL1cachesto32KB2-way. This
ensures that in spite of the moderate write set sizes in our workloads they experience overﬂows due
to associativity constraints. Every L1 has access to a 64 entry SM-cache. Each metadata entry is
136 bytes.
We use ﬁve benchmarks in our study: Bayes, Delaunay, Labyrinth, and Vacation from the
STAMP suite, and STMBench7. As Table 6(b) shows, these benchmarks have the largest write
sets and are most likely to generate L1 cache overﬂows, enabling us to highlight tradeoffs among
the various virtualization mechanisms. The fraction of total transactions that experience overﬂows
in Bayes, Delaunay, Labyrinth, Vacation and STMBench7 is 11%, 8%, 25%, 9% and 32% respec-
tively.
We compare FlexTM-S’s performance against the following Lazy TM systems:(1) FlexTM,
which employs a hardware controller for overﬂowed state and signatures for conﬂict detection,
(2) XTM [11], which uses virtual memory to implement all TM operations; (3) XTM-e, which
employs virtual memory support for versioning but performs conﬂict detection using cache-line
granularity tag bits; and (4) SigTM [30], which uses hardware signatures for conﬂict detection
and software instrumentation for word-granularity versioning. All systems employ the Polka [37]
contention manager.
29Result 1: A software maintained metadata cache is sufﬁcient to provide virtualization support
with negligible overhead.
As shown in Figure 8.4, FlexTM-S imposes modest performance penalty (10%) compared to
FlexTM. This is encouraging since it is vastly simpler to implement the SM-cache than the con-
troller in FlexTM. The SM-cache miss and copyback handlers are the main contributors to the
overhead. Unlike FlexTM and FlexTM-S, which version only the overﬂowed cache lines, XTM
and XTM-e suffers from the overhead of page-granularity versioning. XTM’s page-granularity
conﬂict detection also leads to excessive aborts. XTM and XTM-e both rely on heavyweight OS
mechanisms; by contrast, FlexTM-S requires only user-level interrupt handlers. Finally, SigTM
incurs signiﬁcant overhead due to software lookaside checks to determine if an accessed location
is being buffered.
FlexTM FlexTM-S XTM-e SigTM XTM
0
0.2
0.4
0.6
0.8
1
N
o
r
m
a
l
i
z
e
d
 
T
h
r
o
u
g
h
p
u
t
(a) Bayes
0
0.2
0.4
0.6
0.8
1
N
o
r
m
a
l
i
z
e
d
 
T
h
r
o
u
g
h
p
u
t
(b) Delaunay
0
0.2
0.4
0.6
0.8
1
N
o
r
m
a
l
i
z
e
d
 
T
h
r
o
u
g
h
p
u
t
(c) Labyrinth
0
0.2
0.4
0.6
0.8
1
N
o
r
m
a
l
i
z
e
d
 
T
h
r
o
u
g
h
p
u
t
(d) Vacation
0
0.2
0.4
0.6
0.8
1
N
o
r
m
a
l
i
z
e
d
 
T
h
r
o
u
g
h
p
u
t
(e) STMBench7
Figure 9: Throughput at 16 threads for FlexTM-S vs. other TMs, normalized to FlexTM.
We also analyzed the inﬂuence of signature false positives. In FlexTM-S, write signature false
positives can lead to increased handler invocation for loading the SM-cache, but the software meta-
data can be used to disambiguate and avoid abort penalty. In FlexTM, signature responses are
treated as true conﬂicts, and causes contention manager invocations that could lead to excessive
aborts. We set the W sig and Osig to 32 bits (see Figure 10) to investigate the performance penalties
of small write signatures.
Result 4: As Figure 10 shows, FlexTM-S’s use of software metadata to disambiguate false
positives helps reduce the needed size of hardware signatures while maintaining high performance.
9. Conclusions
FlexTM introduces Conﬂict Summary Tables; combines them with Bloom ﬁlter signatures, alert-
on-update, and programmable data isolation; and virtualizes the combination across context
30FlexTM (32bit W sig) FlexTM-S (32bit Osig and OWsig)
0
0.2
0.4
0.6
0.8
1
Bayes Labyrinth Delaunay STMBench7
N
o
r
m
a
l
i
z
e
d
 
T
h
r
o
u
g
h
p
u
t
Figure 10: Signature size effect (relative to FlexTM with 2048-bit W sig).
switches, overﬂow, and page-swaps. The resulting system provides TM support that decouples
the conﬂict detection mechanism from conﬂict resolution time and allows software to control the
latter(i.e., Eager , Lazy orMixed), resultinginahighperformanceTMsubstrateonwhichsoftware
can dictate policy. To the best of our knowledge, it is the ﬁrst hardware TM to admit an STM-like
distributed commit protocol, allowing an unbounded number of Lazy and/or Eager transactions
to arbitrate and commit in parallel. To virtualize transaction state, we propose two alternative
designs—an aggressive hardware controller and a complexity-effective hardware-software design.
The latter was evaluated via the FlexTM-S TM system, which further simpliﬁes the versioning
mechanism by supporting a Mixed mode for conﬂict resolution.
On a variety of benchmarks, FlexTM imposes minimal TM runtime overheads (comparable to
sequential thread latency) and attains 5 more performance than STM and 1.8 more perfor-
mance than Hybrid TMs. Experiments with centralized commit schemes indicate that FlexTM’s
distributed protocol is free from the arbitration and serialization overheads of centralized hardware
managers. Finally, comparing FlexTM-S with other virtualization mechanisms, we ﬁnd that it is a
complexity-effective alternative with <10% performance loss compared to the base FlexTM with
full hardware-based overﬂow controller support.
Though we do not elaborate on the possibility here, we have also begun to experiment with
non-TM uses of our decoupled hardware [39, 40, TR version]; we expect to extend this work by
developing more general interfaces and exploring their applications.
10. Biography
ArrvindhShriraman isagraduatestudentincomputerscienceattheUniversityofRochester.
He received his B.E. from the University of Madras, India, and his M.S. from the University of
Rochester. His research interests include multiprocessor system design, hardware-software inter-
face, and parallel programming models.
Sandhya Dwarkadas is a Professor of Computer Science and of Electrical and Computer
Engineering at the University of Rochester. Her research lies at the interface of hardware and
software with a particular focus on concurrency, resulting in numerous publications that cross
areas within systems. She has recently been associate editor for IEEE Computer Architecture
Letters (2006—2009), program and general chair for ISPASS’07 and ISPASS’08 respectively, and
is a past associate editor for IEEE Transactions on Parallel and Distributed Systems.
31Michael Scott is a Professor and past Chair of the Computer Science Department at the Uni-
versity of Rochester. He is a a Fellow of the ACM and the IEEE, a recipient of the Dijkstra
Prize in Distributed Computing, and author of the textbook Programming Language Pragmatics
(3rd edition, Morgan Kaufmann, 2009). He was recently Program Chair of TRANSACT’07 and of
PPoPP’08.
References
[1] C. S. Ananian, K. Asanovic, B. C. Kuszmaul, C. E. Leiserson, and S. Lie. Unbounded Transactional Memory. In
Proc. of the 11th Intl. Symp. on High Performance Computer Architecture, pages 316-327, San Francisco, CA,
Feb. 2005.
[2] L. Baugh, N. Neelakantan, and C. Zilles. Using Hardware Memory Protection to Build a High-Performance,
Strongly Atomic Hybrid Transactional Memory. In Proc. of the 35th Intl. Symp. on Computer Architecture,
Beijing, China, June 2008.
[3] B. H. Bloom. Space/Time Trade-Off in Hash Coding with Allowable Errors. Comm. of the ACM, 13(7):422-426,
July 1970.
[4] C. Blundell, E. C. Lewis, and M. M. K. Martin. Subtleties of Transactional Memory Atomicity Semantics. IEEE
Computer Architecture Letters, 5(2), Nov. 2006.
[5] J. Bobba, K. E. Moore, H. Volos, L. Yen, M. D. Hill, M. M. Swift, and D. A. Wood. Performance Pathologies in
Hardware Transactional Memory. In Proc. of the 34th Intl. Symp. on Computer Architecture, pages 32-41, San
Diego, CA, June 2007.
[6] J. Bobba, N. Goyal, M. D. Hill, M. M. Swift, and D. A. Wood. TokenTM: Efﬁcient Execution of Large Transac-
tions with Hardware Transactional Memory. In Proc. of the 35th Intl. Symp. on Computer Architecture, Beijing,
China, June 2008.
[7] L. Ceze, J. Tuck, C. Cascaval, and J. Torrellas. Bulk Disambiguation of Speculative Threads in Multiprocessors.
In Proc. of the 33rd Intl. Symp. on Computer Architecture, Boston, MA, June 2006.
[8] L. Ceze, J. Tuck, P. Montesinos, and J. Torrellas. BulkSC: Bulk Enforcement of Sequential Consistency. In
Proc. of the 34th Intl. Symp. on Computer Architecture, San Diego, CA, June 2007.
[9] H. Chaﬁ, J. Casper, B. D. Carlstrom, A. McDonald, C. Cao Minh, W. Baek, C. Kozyrakis, and K. Olukotun. A
Scalable, Non-blocking Approach to Transactional Memory. In Proc. of the 13th Intl. Symp. on High Perfor-
mance Computer Architecture, Phoenix, AZ, Feb. 2007.
[10] W. Chuang, S. Narayanasamy, G. Venkatesh, J. Sampson, M. V. Biesbrouck, G. Pokam, B. Calder, and O.
Colavin. Unbounded Page-Based Transactional Memory. In Proc. of the 12th Intl. Conf. on Architectural
Support for Programming Languages and Operating Systems, pages 347-358, San Jose, CA, Oct. 2006.
[11] J. Chung, C. Cao Minh, A. McDonald, T. Skare, H. Chaﬁ, B. D. Carlstrom, C. Kozyrakis, and K. Olukotun.
Tradeoffs in Transactional Memory Virtualization. In Proc. of the 12th Intl. Conf. on Architectural Support for
Programming Languages and Operating Systems, pages 371-381, San Jose, CA, Oct. 2006.
[12] L. Dalessandro and M. L. Scott. Strong Isolation is a Weak Idea. In 4th ACM SIGPLAN Workshop on Transac-
tional Computing, Raleigh, NC, Feb. 2009.
[13] P. Damron, A. Fedorova, Y. Lev, V. Luchangco, M. Moir, and D. Nussbaum. Hybrid Transactional Memory. In
Proc. of the 12th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, San
Jose, CA, Oct. 2006.
[14] D. Dice, O. Shalev, and N. Shavit. Transactional Locking II. In Proc. of the 20th Intl. Symp. on Distributed
Computing, pages 194-208, Stockholm, Sweden, Sept. 2006.
[15] K. Fraser and T. Harris. Concurrent Programming Without Locks. ACM Trans. on Computer Systems, 25(2):ar-
ticle 5, May 2007.
[16] J. Friedrich, B. McCredie, N. James, B. Huott, B. Curran, E. Fluhr, G. Mittal, E. Chan, Y. Chan, D. Plass, S.
Chu, H. Le, L. Clark, J. Ripley, S. Taylor, J. Dilullo, and M. Lanzerotti. Design of the Power6 Microprocessor.
In Proc. of the Intl. Solid State Circuits Conf., pages 96-97, San Francisco, CA, Feb. 2007.
[17] J. R. Goodman. Coherency for Multiprocessor Virtual Address Caches. In Proc. of the 2nd Intl. Conf. on
Architectural Support for Programming Languages and Operating Systems, pages 72-81, Oct. 1987.
32[18] R. Guerraoui, M. Kapa´ lka, and J. Vitek. STMBench7: A Benchmark for Software Transactional Memory. In
Proc. of the 2nd EuroSys, Lisbon, Portugal, Mar. 2007.
[19] L. Hammond, V. Wong, M. Chen, B. Hertzberg, B. Carlstrom, M. Prabhu, H. Wijaya, C. Kozyrakis, and K.
Olukotun. Transactional Memory Coherence and Consistency. In Proc. of the 31st Intl. Symp. on Computer
Architecture, M¨ unchen, Germany, June 2004.
[20] M. Herlihy, V. Luchangco, M. Moir, and W. N. Scherer III. Software Transactional Memory for Dynamic-sized
Data Structures. In Proc. of the 22nd ACM Symp. on Principles of Distributed Computing, pages 92-101, Boston,
MA, July 2003.
[21] M. Herlihy and J. E. Moss. Transactional Memory: Architectural Support for Lock-Free Data Structures. In
Proc. of the 20th Intl. Symp. on Computer Architecture, San Diego, CA, May 1993. Expanded version available
as CRL 92/07, DEC Cambridge Research Laboratory, Dec. 1992.
[22] M.D.Hill, D.Hower, K.E.Moore, M.M.Swift, H.Volos, andD.A.Wood. ACaseforDeconstructingHardware
Transactional Memory Systems. Technical Report 1594, Dept. of Computer Sciences, Univ. of Wisconsin–
Madison, June 2007.
[23] S. Kumar, M. Chu, C. J. Hughes, P. Kundu, and A. Nguyen. Hybrid Transactional Memory. In Proc. of the 11th
ACM Symp. on Principles and Practice of Parallel Programming, New York, NY, Mar. 2006.
[24] J. R. Larus and R. Rajwar. Transactional Memory, Synthesis Lectures on Computer Architecture. Morgan &
Claypool, 2007.
[25] J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proc. of the 24th Intl. Symp.
on Computer Architecture, Denver, CO, June 1997.
[26] Y.LevandJ.-W.Maessen. SplitHardwareTransaction: TrueNestingofTransactionsUsingBest-effortHardware
Transactional Memory. In Proc. of the 13th ACM Symp. on Principles and Practice of Parallel Programming,
Salt Lake City, UT, Feb. 2008.
[27] V. J. Marathe, W. N. Scherer III, and M. L. Scott. Adaptive Software Transactional Memory. In Proc. of the 19th
Intl. Symp. on Distributed Computing, Cracow, Poland, Sept. 2005.
[28] V. J. Marathe, M. F. Spear, C. Heriot, A. Acharya, D. Eisenstat, W. N. Scherer III, and M. L. Scott. Lowering the
Overhead of Software Transactional Memory. In Proc. of the 1st ACM SIGPLAN Workshop on Transactional
Computing, Ottawa, ON, Canada, June 2006. Expanded version available as TR 893, Dept. of Computer Science,
Univ. of Rochester, Mar. 2006.
[29] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D.
Hill, and D. A. Wood. Multifacet’s General Execution-driven Multiprocessor Simulator (GEMS) Toolset. In
ACM SIGARCH Computer Architecture News, Sept. 2005.
[30] C. Cao Minh, M. Trautmann, J. Chung, A. McDonald, N. Bronson, J. Casper, C. Kozyrakis, and K. Olukotun.
An Effective Hybrid Transactional Memory System with Strong Isolation Guarantees. In Proc. of the 34th Intl.
Symp. on Computer Architecture, San Diego, CA, June 2007.
[31] K. E. Moore, J. Bobba, M. J. Moravan, M. D. Hill, and D. A. Wood. LogTM: Log-based Transactional Memory.
In Proc. of the 12th Intl. Symp. on High Performance Computer Architecture, Austin, TX, Feb. 2006.
[32] R. Rajwar, M. Herlihy, and K. Lai. Virtualizing Transactional Memory. In Proc. of the 32nd Intl. Symp. on
Computer Architecture, Madison, WI, June 2005.
[33] B. Saha, A.-R. Adl-Tabatabai, R. L. Hudson, C. Cao Minh, and B. Hertzberg. McRT-STM: A High Performance
SoftwareTransactionalMemorySystemforaMulti-CoreRuntime. InProc.ofthe11thACMSymp.onPrinciples
and Practice of Parallel Programming, NY, USA, Mar. 2006.
[34] N. Sakran, M. Yuffe, M. Mehalel, J. Doweck, E. Knoll, and A. Kovacs. The Implementation of the 65nm
Dual-Core 64b Merom Processor. In Proc. of the Intl. Solid State Circuits Conf., San Francisco, CA, Feb. 2007.
[35] D. Sanchez, L. Yen, M. D. Hill, and K. Sankaralingam. Implementing Signatures for Transactional Memory. In
Proc. of the 40th Intl. Symp. on Microarchitecture, Chicago, IL, Dec. 2007.
[36] W. N. Scherer III and M. L. Scott. Advanced Contention Management for Dynamic Software Transactional
Memory. In Proc. of the 24th ACM Symp. on Principles of Distributed Computing, Las Vegas, NV, July 2005.
[37] W. N. Scherer III and M. L. Scott. Randomization in STM Contention Management (poster paper). In Proc. of
the 24th ACM Symp. on Principles of Distributed Computing, Las Vegas, NV, July 2005.
[38] M. L. Scott. Sequential Speciﬁcation of Transactional Memory Semantics. In Workshop on 1st ACM SIGPLAN
Workshop on Transactional Computing, Ottawa, ON, Canada, June 2006.
33[39] A. Shriraman, M. F. Spear, H. Hossain, S. Dwarkadas, and M. L. Scott. An Integrated Hardware-Software
Approach to Flexible Transactional Memory. In Proc. of the 34th Intl. Symp. on Computer Architecture, San
Diego, CA, June 2007. Earlier but expanded version available as TR 910, Dept. of Computer Science, Univ. of
Rochester, Dec. 2006.
[40] A. Shriraman, S. Dwarkadas, and M. L. Scott. Flexible Decoupled Transactional Memory Support. In Proc. of
the 35th Intl. Symp. on Computer Architecture, Beijing, China, June 2008. Expanded version available as TR
925, URCS, Nov. 2007.
[41] A. Shriraman and S. Dwarkadas. Refereeing Conﬂicts in Hardware Transactional Memory. In Proc. of the 2009
ACM Intl. Conf. on Supercomputing, NY, USA, June 2009.
[42] M. F. Spear, V. J. Marathe, W. N. Scherer III, and M. L. Scott. Conﬂict Detection and Validation Strategies for
Software Transactional Memory. In Proc. of the 20th Intl. Symp. on Distributed Computing, Stockholm, Sweden,
Sept. 2006.
[43] M. F. Spear, A. Shriraman, H. Hossain, S. Dwarkadas, and M. L. Scott. Alert-on-Update: A Communication Aid
for Shared Memory Multiprocessors (poster paper). In Proc. of the 12th ACM Symp. on Principles and Practice
of Parallel Programming, San Jose, CA, Mar. 2007.
[44] M. F. Spear, L. Dalessandro, V. Marathe, and M. L. Scott. A Comprehensive Strategy for Contention Manage-
ment in Software Transactional Memory. In Proc. of the 14th ACM Symp. on Principles and Practice of Parallel
Programming, Mar. 2009.
[45] S. Tomi´ c, C. Perfumo, C. Kulkarni, A. Armejach, A. Cristal, O. Unsal, T. Harris, and M. Valero. EazyHTM,
Eager-Lazy Hardware Transactional Memory. In Proc. of the 42th Intl. Symp. on Microarchitecture, NY, USA,
Dec. 2009.
[46] L. Yen, J. Bobba, M. R. Marty, K. E. Moore, H. Valos, M. D. Hill, M. M. Swift, and D. A. Wood. LogTM-SE:
Decoupling Hardware Transactional Memory from Caches. In Proc. of the 13th Intl. Symp. on High Performance
Computer Architecture, Phoenix, AZ, Feb. 2007.
[47] C. Zilles and L. Baugh. Extending Hardware Transactional Memory to Support Non-Busy Waiting and Non-
Transactional Actions. In Proc. of the 1st ACM SIGPLAN Workshop on Transactional Computing, Ottawa, ON,
Canada, June 2006.
[48] Sun Microsystems Inc. OpenSPARC T2 Core Microarchitecture Speciﬁcation. July 2005.
[49] TheRochesterSoftwareTransactionalMemoryRuntime. 2006. www.cs.rochester.edu/research/synchronization/rstm/.
34