First ACM SIGPLAN Workshop on Languages, compilers and Hardware Support for Transactional Computing by Vitek, Jan & Jagannathan, Suresh
Purdue University 
Purdue e-Pubs 
Department of Computer Science Technical 
Reports Department of Computer Science 
2006 
First ACM SIGPLAN Workshop on Languages, compilers and 
Hardware Support for Transactional Computing 
Jan Vitek 
Purdue University, jv@cs.purdue.edu 
Suresh Jagannathan 
Purdue University, suresh@cs.purdue.edu 
Report Number: 
06-011 
Vitek, Jan and Jagannathan, Suresh, "First ACM SIGPLAN Workshop on Languages, compilers and 
Hardware Support for Transactional Computing" (2006). Department of Computer Science Technical 
Reports. Paper 1654. 
https://docs.lib.purdue.edu/cstech/1654 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 
Please contact epubs@purdue.edu for additional information. 
FIRST ACM SIGPLAN WORKSHOP ON 
LANGUAGES, COMPILERS AND HARDWARE 
SUPPORT FOR TRANSACTIONAL COMPUTING 
Participant Proceedings 
Jan Vitek - General Chair 
Suresh Jagannathan - Program Chair 








First ACM SIGPLAN Workshop on 
Languages, Compilers, and Hardware Support 
for Transactional Computing 
June 11, 2006 
Ottawa, Canada 
PARTICIPANT PROCEEDINGS
es, il rs, t




Transact706 was held in conjunction with PLDI on June 11, 2006 in Ottawa, 
Canada. The goal of the workshop as stated in the call for papers was to 
provide a forum for the presentation of novel research covering all aspects 
of transactional computing, including new software or hardware techniques, 
algorithms, implementations, and analyses. 
In response to the call, 19 high-quality submissions were received, includ- 
ing two submissions from P C  members. Each submission was rigorously 
reviewed by at least three members of the program commmittee; in several 
instances, outside reviews from experts were also solicited. After extensive 
deliberation, 10 papers were chosen for presentation at the workshop. 
I would like thank all the members of the PC for their thoughtful and detailed 
reviews, Jan Vitek for kindly agreeing to serve as General Chair, the steering 
committee for their useful advice on the workshop organization and theme, 















Cliff Click, Azul 
Laurent Daynes, Sun 
Rick Hudson, Intel 
Stephen Freund, Williams 
Dan Grossman, Washington 
Suresh Jagannathan, Purdue 
Christos Kozyrakis, Stanford 
Peter OIHearn, Queen Mary, U. of London 
Bill Pugh, Maryland 
Ravi Rajwar, Intel 
Nir Shavit, Sun 
David Tarditi, Microsoft 
Mandana Vaziri, IBM 
General Chair: 
Jan Vitek, Purdue 
Program Chair: 
Suresh Jagannathan, Purdue 
Steering Committee 
Tim Harris, Microsoft 
Maurice Herlihy, Brown 
Tony Hosking, Purdue 
Doug Lea, SUNY, Oswego 
Eliot Moss, UMass, Amherst 
Jan Vitek, Purdue 
itt
liff li , l
r t ,
i , t l
t r , illi
r , i t
r t , r
i t i , t
t ' earn, r , . f
ill , r l
i j r, I t l
i it,










Nesting Transactions: Why and What do we need? 
Eliot Moss, University of Massachusetts Amherst 
We are seeing many proposals supporting atomic transactions in program- 
ming languages, software libraries, and hardware, some with and some with- 
out support for nested transactions. I argue that it is important to support 
nesting, and to go beyond closed nesting to open nesting. I will argue as to 
the general form open nesting should take and why, namely that it is a prop- 
erty of classes (data types) not code regions, and must include support for 
programmed concurrency control as well as programmed rollback. I will also 
touch on the implications for software or hardware transactional memory in 







Session 1: Software Transactions ssi : ft are r s cti s
Lowering the Overhead of 
Nonblocking Software Transactional Memory * 
Virendra J. Marathe Michael F. Spear Christopher Heriot 
Athul Acharya David Eisenstat William N. Scherer I11 Michael L. Scott 
Computer Science Dept., University of Rochester 
{vmarathe,spear,cheriot,aacharya,eisen,scherer,scott)@cs.rochester.edu 
Abstract 
Recent years have seen the development of several different sys- 
tems for software transactional memory (STM). Most either em- 
ploy locks in the underlying implementation or depend on thread- 
safe general-purpose garbage collection to collect stale data and 
metadata. 
We consider the design of low-overhead, obstruction-free soft- 
ware transactional memory for non-garbage-collected languages. 
Our design eliminates dynamic allocation of transactional meta- 
data and co-locates data that are separate in other systems, thereby 
reducing the expected number of cache misses on the common- 
case code path, while preserving nonbloclung progress and requir- 
ing no atomic instructions other than single-word load, store, and 
compare-and-swap (or load-linkedlstore-conditional). We also em- 
ploy a simple, epoch-based storage management system and intro- 
duce a novel conservative mechanism to make reader transactions 
visible to writers without inducing additional metadata copying or 
dynamic allocation. Experimental results show throughput signifi- 
cantly higher than that of existing nonblocking STM systems, and 
highlight significant application-specific differences among con- 
flict detection and validation strategies. 
General Terms transactional memory, nonblocking synchro- 
nization, obstruction freedom, storage management, visible readers 
1. Introduction 
Recent years have seen the development of several new systems for 
software transactional memory (STM). Interest in these systems is 
high because hardware vendors have largely abandoned the quest 
for faster uniprocessors, and 40 years of evidence suggests that only 
the most talented programmers can write good lock-based code. 
In comparison to locks, transactions avoid the correctness prob- 
lems of priority inversion, deadlock, and vulnerability to thread 
failure, as well as the performance problems of lock convoying 
and vulnerability to preemption and page faults. Perhaps most im- 
portant, they free programmers from the unhappy choice between 
concurrency and conceptual clarity: transactions combine, to first 
approximation, the simplicity of a single coarse-grain lock with the 
high-contention performance of fine-grain locks. 
Originally proposed by Herlihy and Moss as a hardware mech- 
anism [16], transactional memory (TM) borrows the notions of 
atomicity, consistency, and isolation from database transactions. In 
'This work was supported in part by NSF grants CCR-0204344, CNS- 
041 1127, and CNS-0509270; an IBM Faculty Partnership Award; and by 
financial and equipment support from Sun Microsystems; and financial 
support from Intel. 
a nutshell, the programmer labels a body of code as atomic, and 
the underlying system finds a way to execute it together with other 
atomic sections in such a way that they all appear to linearize 1141 
in an order consistent with their threads' other activities. When two 
active transactions are found to be mutually incompatible, one will 
abort and restart automatically. The ability to abort transactions 
eliminates the complexity (and potential deadlocks) of fine-grain 
locking protocols. The ability to execute (nonconflicting) trans- 
actions simultaneously leads to potentially high performance: a 
high-quality implementation should maximize physical parallelism 
among transactions whenever possible, while freeing the program- 
mer from the complexity of doing so. 
Modern TM systems may be implemented in hardware, in soft- 
ware, or in some combination of the two. We focus here on soft- 
ware. Some STM systems are implemented using locks [2,25,32]. 
Others are nonblocking [4, 5, 9, 13, 21.1. While there is evidence 
that lock-based STM may be faster in important cases (notably 
because they avoid the overhead of creating new copies of to-be- 
modified objects), such systems solve only some of the traditional 
problems of locks: they eliminate the crucial concurrencylclarity 
tradeoff, but they remain vulnerable to priority inversion, thread 
failure, convoying, preemption, and page faults. We have chosen in 
our work to focus on nonblocking STM. 
More specifically, we focus on obstruction-free STM, which 
simplifies the implementation of linearizable semantics by allow- 
ing forward progress to be delegated to an out-of-band contention 
manager. As described by Herlihy et al. [12], an obstruction-free 
algorithm guarantees that a given thread, starting from any fea- 
sible system state, will make progress in a bounded number of 
steps if other threads refrain from performing conflicting opera- 
tions. Among published STM systems, DSTM [13], WSTM [9], 
ASTM [21], and (optionally) SXM [5] are obstruction-free. DSTM, 
ASTM, and SXM employ explicitly segregated contention manage- 
ment modules. Experimentation with these systems confirms that 
carefully tuned contention management can dramatically improve 
performance in applications with high contention [5,6,26,27,28]. 
Existing STM systems also differ with respect to the granularity 
of sharing. A few, notably WSTM [9] and (optionally) McHT [25], 
are typically described as word-based, though the more general 
term might be "block-based": they detect conflicts and enforce 
consistency on fixed-size blocks of memory, independent of high- 
level data semantics. Most proposals for hardware transactional 
memory similarly operate at the granularity of cache lines [I ,  8,22, 
23,241. While block-based TM appears to be the logical choice for 
hardware implementation, it is less attractive for software: the need 
to instrument all-or at least most-load and store instructions 
may impose unacceptable overheads. In the spirit of traditional file 
system operations, object-based STM systems employ an explicit 
open operation that incurs the bookkeeping overhead once, up 
. .
t l r i i t t illi . r r III i l . tt
t i t., i it t
r t ,s r, ri t, r ,eis , r r, tt} cs.roc t r.
t
t l
t r t t ti l ). t it








~ t , il ki
m t i ,
r / l).
l i l ,
l ti
i i l it r l
ic ll ti . t l ifi-
tl t
i i i i t i
fli t t t .




ft ti l r ).
i
f r t , rs
t .
ri ,
l r i , ,
f ilure, s ll s i
l ilit lt .
t t, e
rr t l it : i , t
i li it
i t t - rain .
i i ll
i 6], ti l r
t icity, ist cy, .
• This rted - 4344,
411 ~7, N~-0509270; lt t i r ;
fi ancial i e t rt ste s;
rt fr I tel.
t ll, t r r r l l f t i ,





























t ti , j t t l li it
ti t t i t i ,
2006/5/15
front, for all transactional accesses to some language-level object. 
The rest of this paper focuses on object-based STM. 
Object-based STM systems are often but not always paired with 
object-oriented languages. One noteworthy exception is Fraser's 
OSTM [4], which supports a C-based API. DSTM and ASTM are 
Java-based. SXM is for C#. Implementations for languages like 
these benefit greatly from the availability of automatic garbage 
collection (as does STM Haskell [lo]). Object-based STM systems 
have tended to allocate large numbers of dynamic data copies and 
metadata structures; figuring out when to manually reclaim these is 
a daunting task. 
While recent innovations have significantly reduced the cost 
of STM, current systems are still nearly an order of magnitude 
slower than lock-based critical sections for simple, uncontended 
operations. A major goal of the work reported here is to understand 
the remaining costs, to reduce them wherever possible, and to 
explain why the rest are unavoidable. Toward this end we have 
developed the Rochester Software Transactional Memory runtime 
(RSTM), which (1) employs only a single level of indirection to 
access data objects (rather than the more common two), thereby 
reducing cache misses, (2) avoids dynamic allocation or collection 
of per-object or per-transaction metadata, (3) avoids any copying of 
objects for read-only transactions, (4) avoids tracing or reference 
counting garbage collection altogether, and (5) supports a variety 
of options for conflict detection and contention management. 
RSTM is written in C++, allowing its API to make use of 
inheritance and templates. It could also be used in C, though such 
use would be less convenient. We do not yet believe the system 
is as fast as possible, but preliminary results suggest that it is a 
significant step in the right direction, and that it is convenient, 
robust, and fast enough to provide a highly attractive alternative 
to locks in many applications. 
In Section 2 we briefly survey existing STM systems, focus- 
ing on the functionality they must provide and the overheads they 
typically suffer. Section 3 introduces our RSTM system, focusing 
on metadata management, storage management, and the C++ API. 
Section 4 presents performance results. We compare RSTM to both 
coarse and fine-grain locking on a variety of common microbench- 
marks; and apportion that overhead among memory management, 
data copying, conflict detection, contention management, and other 
object and transaction bookkeeping. Section 5 summarizes our con- 
clusions and enumerates issues for future research. 
2. Existing STM Systems 
Existing STM systems can be categorized in many ways, several 
of which are explored in our previous papers [19, 20, 21, 291. All 
share certain fundamental characteristics: shared memory is orga- 
nized as a collection of logical or physical blocks, which deter- 
mine the granularity at which accesses may be made. A transaction 
that wishes to update several blocks must first acquire ownership 
of those blocks. Ownership resembles a revocable ("stealable") 
lock [ l  I]. Any transaction that wishes to access an acquired block 
can find the descriptor of the transaction that owns it. The descrip- 
tor indicates whether the owner is active, committed, or aborted. A 
block that belongs to an active transaction can be stolen only if the 
stealer first aborts the owner. 
Acquisition is the hook that permits confit detection: it makes 
writers visible to one another and to readers. Acquisition can occur 
any time between the initial access to a block and final transaction 
commit. Later acquisition allows greater speculation in TM imple- 
mentations, and provides more opportunity for potentially conflict- 
ing transactions to execute in parallel. Parallelism between a writer 
and a group of readers can be 100% productive if the writer finishes 
last. Parallelism among multiple writers is more purely speculative: 
only one can commit, but there is in general no way to tell up front 
which one it "ought" to be. Once readers or writers are visible, 
choosing the circumstances under which to steal a block (and thus 
to abort the owner) is the problem of contention management. 
2.1 Major Design Decisions 
As noted in Section 1, most STM proposals work at the granular- 
ity of language-level objects, accessed via pointers. A transaction 
that wishes to access an object opens it for read-only or read-write 
access. (An object already open for read-only access may also be 
upgraded to read-write access.) If the object may be written, the 
transaction creates a new copy on which to make modifications. 
Some time prior to committing, the transaction must acquire each 
object it wishes to modify, and ensure that no other transaction has 
acquired any of the objects it has read. Both old and new versions of 
acquired objects remain linked to system metadata while the trans- 
action is active. A one-word change to the transaction descriptor 
implicitly effects all new versions as valid if the transaction com- 
mits, or restores all old versions if it aborts. 
Metadata organization. Information about acquired objects 
must be maintained in some sort of transactional metadata. This 
metadata may be organized in many ways. Two concrete possibili- 
ties appear in Figures 1 and 2. In the DSTM of Herlihy et al. [13], 
(Figure I), an Object Header (pointer) points at a h c a t o r  structure, 
which in turn points at old and new copies of the data, and at the 
descriptor of the most recent transaction to acquire the object. If 
the transaction has committed, the new copy of the data is current. 
If the transaction has aborted, the old copy of the data is current. If 
the transaction is active, the data cannot safely be read or written 
by any other transaction. A writer acquires an object by creating 
and initializing a new copy of the data and a new Locator structure, 
and installing this Locator in the Object Header using an atomic 
compare-and-swap (CAS) instruction.' 
In the OSTM of Fraser and Harris [3], (Figure 2), an Object 
Header usually points directly at the current copy of the data. To 
acquire an object for read-write access, a transaction changes the 
Object Header to point at the transaction descriptor. The descriptor, 
in turn, contains lists of objects opened for read-only or read-write 
access. List entries for writable objects include pointers to old and 
new versions of the data. The advantage of this organization is 
that a conflicting transaction (one that wishes to access a currently 
acquired object) can easily find all of its competitor's metadata. 
The disadvantage is that it must peruse the competitor's read-write 
list to find the current copy of any given object. We refer to the 
DSTM approach as per-object metadata; we refer to the OSTM 
approach as per-transaction metadata. Our RSTM system uses per- 
object metadata, but it avoids the need for Locators by merging 
their contents into the newer data object, which in turn points to the 
older. Details can be found in Section 3. 
Conflict detection. Existing STM systems also differ in the 
time at which writers acquire objects and perform conflict de- 
tection. Some systems, including DSTM, SXM [5], WSTM [9], 
and McRT [25], are eager: writers acquire objects at open time. 
Others, including OSTM, STM Haskell [lo], and Transactional 
Monitors [32], are lazy: they delay acquires until just before com- 
mit time. Eager acquire allows conflicts between transactions to 
be detected early, possibly avoiding useless work in transactions 
that are doomed to abort. At the same time, eager acquire admits 
the possibility that a transaction will abort a competitor and then 
fail to commit itself, thereby wasting any work that the aborted 
competitor had already performed. Lazy acquire has symmetric 
properties: it may allow doomed transactions to continue, but it 
'Throughout this paper we use CAS for atomic updates. In all cases load- 





















































Locator Data Object - 
Status 
I old version I 
Object Header 
Transaction ' 
New Data -- 
new version 
Data Object - 
old version 
Status 6 4 6 s  -- 
Read-Write list - 
Read-only list - 
Figure 1. The Dynamic Software Transactional Mem- Figure 2. The Object-Based Software Transactional Memory (OSTM) of Fraser 
ory (DSTM) of Herlihy et al. A writer acquires an ob- and Harris. Objects are added to the a transaction's read-only and read-write lists 
ject at open time. It creates and initializes a new Loca- at open time. To acquire an object, a writer uses a compare-and-swap instruction 
tor (with a pointer to a new copy of the previously valid to swing the Object Header's pointer from the old version of the Data Object to 
Data Object), and installs this Locator in the Object the Transaction Descriptor. After the transaction commits (or aborts), a separate 
Header using an atomic compare-and-swap instruction. cleanup phase swings the pointer from the Transaction Descriptor to the new (or 
old) version of the Data Object. 
Object Handles 
may also overlook potential conflicts that never actually materi- 
alize. In particular, lazy acquire potentially allows short-running 
readers to commit in parallel with the execution of a long-running 
writer that also commits. 
In either case-eager or lazy conflict detection-writers are 
visible to readers and to writers. Readers may or may not, however, 
be visible to writers. In the original version of DSTM, readers are 
invisible: a reader that opens an object after a writer can make an 
explicit decision as to which of the two transactions should take 
precedence, but a writer that opens an object after a reader has no 
such opportunity. Newer versions of DSTM add an explicit list of 
visible readers to every transactional object, so writers, too, can 
detect concurrent readers. The visibility of readers also has a major 
impact on the cost of validation, which we discuss later in this 
section. Like our Java-based ASTM system [21], RSTM currently 
supports both eager and lazy acquire. It also supports both visible 
and invisible readers. The results in Section 4 demonstrate that all 
four combinations can be beneficial, depending on the application. 
Adapting intelligently among these is a focus of ongoing work. 
Transaction Descriptor 
Contention management. An STM system that uses lazy ac- 
quire knows the complete set of objects it will access before it ac- 
quires any of them. It can sort its read-write list by memory address 
and acquire them in order, thereby avoiding circular dependences 
among transactions and, thus, deadlock. OSTM implements a sim- 
ple strategy for conflict resolution: if two transactions attempt to 
write the same object, the one that acquires the object first is consid- 
ered to be the "winner". To ensure nonblocking progress, the later- 
arriving thread (the "loser") peruses the winner's metadata and re- 
cursively helps it complete its commit, in case it has been delayed 
due to preemption or a page fault. As a consequence, OSTM is able 
to guarantee lockfreedom [IS]: from the point of view of any given 
thread, the system as a whole makes forward progress in a bounded 
number of time steps. 
Unfortunately, helping may result in heavy interconnect con- 
tention and high cache miss rates. Lock freedom also leaves open 
the possibility that a thread will starve, e.g. if it tries repeatedly 
to execute a long, complex transaction in the face of a continual 
stream of short conflicting transactions in other threads. 
Many nonblocking STM systems, including DSTM, SXM, 
WSTM, and ASTM, provide a weaker guarantee of obstruction 
freedom [12] and then employ some external mechanism to main- 
tain forward progress. In the case of DSTM, SXM, and ASTM, 
this mechanism takes the form of an explicit contention manager, 
which prevents, in practice, both livelock and starvation. When a 
transaction A finds that the object it wishes to open has already 
been acquired by some other transaction B, A calls the contention 
manager to determine whether to abort B, abort itself, or wait for 
a while in the hope that B may complete. The design of contention 
management policies is an active area of research [6,7,26,27,28]. 
Our RSTM is also obstruction-free. The experiments reported in 
Section 4 use the "Polka" policy we devised for DSTM [27]. 
Validating Readers. Transactions in a nonblocking object-based 
STM system create their own private copy of each to-be-written 
Data Object. These copies become visible to other transactions at 
acquire time, but are never used by other transactions unless and 
until the writer commits, at which point the data object is im- 
mutable. A transaction therefore knows that its Data Objects, both 
read and written, will never be changed by any other transaction. 
Moreover, with eager acquire a transaction A can verify that it still 
owns all of the objects in its write set simply by checking that the 
status word in its own transaction descriptor is active: to steal one 
of A's objects, a competing transaction must first abort A. 
But what about the objects in A's read set or those in A's write 
set for a system that does lazy acquire? If A's interest in these 
objects is not visible to other transactions, then a competitor that 
acquires one of these objects will not only be unable to perform 
contention management with respect to A (as noted in the para- 
graph on conflict detection above), it will also be unable to inform 
A of its acquire. While A will, in such a case, be doomed to abort 
when it discovers (at commit time) that it has been working with 
an out-of-date version of the object, there is a serious problem in- 
between: absent machinery not yet discussed, a doomed transac- 
tion may open and work with mutually inconsistent copies of dif- 
ferent objects. If the transaction is unaware of such inconsisten- 
cies it may inadvertently perform erroneous operations that cannot 
be undone on abort. Certain examples, including addresslalignment 
errors and illegal instructions, can be caught by establishing an ap- 
propriate signal handler. One can even protect against spurious in- 
finite loops by double-checking transaction status in response to a 
periodic timer signal. Absent complete sandboxing, however [31] 
(implemented via compiler support or binary rewriting), we do not 





oj ".g til oj"0 .c"0










i r . i ft r r ti l -
r ) l.
j t t ti . t t i iti li
t it i t t t i l li
t j ct), i t lls t i t i t j t
i t i .
i r . j t- ft r r ti l r ( ) f r r
rri . j t r t t tr ti ' r - l r - rit li t
t ti . i j t, it i t ti
t i t j t ' i t t l i f t t j t t
t r.
l l ti l li t
li . i l r,
ll l
it t l it .
-eager tion-writers
i t rs. , ,
it rs. i l ,
i isible:
li i i i s
, it r
rtunity. i s
i i l ti l t, , ,
t t rs. t
i ti ,
ti . i ],
rt i .
i i l rs. lt
r i i i l, .
t lli tl s .
te ti age ent.
ir t
ir s .
ir r, i i
ti , s, .
l t t r l ti n:
rite t j t, t
r t
i i t
i l l t it,
ti lt. ,
t t 15 :
t r a , t
t s.
f rt ately, i
t ti .
t i ilit t ll , . .
t t , l l
t t li ti .
,
, , i t t ti
f ] t l t l i t i





















i it l l i t ti t t i t
i l t i il t i iti , t
i it i l t t l t i i t : f i li t
2006/5115
or code pointer can lead to modification of arbitrary (nontransac- 
tional) data or execution of arbitrary code.' 
In the original version of DSTM, with invisible readers, a trans- 
action avoids potential inconsistency by maintaining a private read 
list that remembers all values (references) previously returned by 
read. On every subsequent read the transaction checks to make 
sure these values are still valid, and aborts if any is not. Unfortu- 
nately, for n read objects, this incremental validation incurs 0 ( n 2 )  
aggregate cost. Visible readers solve the problem: a writer that wins 
at contention management explicitly aborts all visible readers of an 
object at acquire time. Readers, for their part, can simply double- 
check their own transaction status when opening a new object-an 
O(1) operation. Unfortunately, visible readers obtain this asymp- 
totic improvement at the expense of a significant increase in con- 
tention: by writing to metadata that would otherwise only be read, 
visible readers tend to invalidate useful lines in the caches of other 
readers. 
Memory Management. Since most STM systems do not use sig- 
nals to immediatelv abort doomed transactions. some degree of au- " 
tomatic storage reclamation is necessary. For example, if transac- 
tion TA reads an object 0 invisibly and is then aborted (implicitly) 
by transaction TB acquiring 0 ,  it is possible for TA to run for an ar- 
bitrary amount of time, reading stale values from 0 .  Consequently, 
even if TB commits, it cannot reclaim space for the older version 
of 0 until it knows that TA has detected that it has been aborted. 
This problem is easily handled by a general purpose garbage 
collector. However, in languages such as C++ that permit ex- 
plicit memory management, we believe that the reclamation policy 
should be decided by the programmer; existing code that carefully 
manages its memory should not have to accept the overhead of a 
tracing collector simply to use transactions. Instead we provide in 
RSTM an epoch-based collection policy for transactional objects 
only. 
2.2 Potential Sources of Overhead 
In trying to maximize the performance of STM, we must consider 
several possible sources of overhead: 
Bookkeeping. Object-based STM typically requires at least n + l  
CAS operations to acquire n objects and commit. It may require an 
additional n CASes for post-commit cleanup of headers. Additional 
overhead is typically incurred for private read lists and write lists. 
These bookkeeping operations impose significant overhead in the 
single-thread or low-contention case. In the high-contention case 
they are overshadowed by the cost of cache misses. RSTM employs 
preallocated read and write lists in the common case to minimize 
bookkeeping overhead, though it requires 2n + 1 CASes. Cache 
misses are reduced in the presence of contention by employing 
a novel metadata structure: as in OSTM, object headers typically 
point directly at the current copy of the data; but as in DSTM, the 
current copy of the data can always be found with at most three 
memory accesses. Details appear in Section 3. 
'Suppose that m 0  is a virtual method of parent class P, from which child 
classes C 1  and C2 arederived. Suppose further that C2 . m O  cannot be called 
safely from transactional code (perhaps it modifies global data under the 
assumption that some lock is held). Now suppose that transaction T reads 
objects x and y, where y contains areference to aP and x identifies the type 
of the reference in y as a (transaction-safe) C 1  object. Unfortunately, after 
T reads x but before it reads y, another transaction modifies both objects, 
putting a Cz reference into y and recording this fact in x. Because x has 
been modified, T is doomed to abort, but if it does not abort right away, it 
may read the Cz  reference in y and call its unsafe method m 0 .  While this 
example is admittedly contrived, it illustrates a fundamental problem: type 
safety is insufficient to eliminate the need for validation. 
Memory management. Both data objects and dynamically al- 
located metadata (transaction descriptors, DSTM Locators, OSTM 
Object Handles) require memory management. In garbage-collected 
languages this includes the cost of tracing and reclamation. In the 
common case, RSTM avoids dynamic allocation altogether for 
transaction metadata; for object data it marks old copies for dele- 
tion at commit time, and lazily reclaims them using a lightweight, 
epoch-based scheme. 
Conflict Resolution. Both the sorting required for deadlock 
avoidance and the helping required for conflict resolution can incur 
significant overhead in OSTM. The analogous costs in obstruction- 
free systems-for calls to a contention manager-appear likely to 
be lower in almost all cases, though it is difficult to separate these 
costs cleanly from other factors. 
In any TM system one might also include as conflict resolution 
overhead the work lost to aborted transactions or to spin-based 
waiting. Like our colleagues at Sun, we believe that obstruction- 
free systems have a better chance of minimizing this useless work, 
because they permit the system or application designer to choose 
a contention management policy that matches (or adapts to) the 
access patterns of the offered workload [27]. 
Validation. RSTM is able to employ both invisible and visible 
readers. As noted above, visible readers avoid O(n2) incremental 
validation cost at the expense of potentially significant contention. 
A detailed evaluation of this tradeoff is the subject of future work. 
In separate work we have developed a hardware mechanism for 
fast, contention-free announcement of read-write conflicts [30]. 
Visible readers in DSTM are quite expensive: to ensure lineariz- 
ability, each new reader creates and installs a new Locator con- 
taining a copy of the entire existing reader list, with its own id 
prepended. RSTM employs an alternative implementation that re- 
duces this overhead dramatically. 
Copying. Every writer creates a copy of every to-be-written ob- 
ject. For small objects the overhead of copying is dwarfed by other 
bookkeeping overheads, but for a large object in which only a small 
change is required, the unneeded copying can be significant. We 
are pursuing hardware assists for in-place data update [29], but this 
does nothing for legacy machines, and is beyond the scope of the 
current paper. For nonblocking systems built entirely in software 
we see no viable alternative to copies, at least without compiler 
support. ' 
3. RSTM Details 
In Section 2 we noted that RSTM (1) adopts a novel organization 
for metadata, with only one level of indirection in the common 
case; (2) avoids dynamic allocation of anything other than (copies 
of) data objects, and provides a lightweight, epoch-based collector 
for data objects; and (3) employs a lightweight heuristic for visible 
reader management. The first three subsections below elaborate on 
these points. Section 3.4 describes the C++ API. 
3.1 Metadata Management 
RSTM metadata is illustrated in Figure 3. Every shared object 
is accessed through an Object Header, which is unique over the 
lifetime of the object. The header contains a pointer to the Data 
Object (call it D)  allocated by the writer (call it W) that most 
recently acquired the object. (The header also contains a list of 
visible readers; we defer discussion of these to Section 3.2.) If the 
low bit of the New Data pointer is zero, then D is guaranteed to be 
the current copy of the data, and its Owner and Old Data pointers 
are no longer needed. If the low bit of the New Data pointer is one, 
With compiler support, rollback is potentially viable. 
r i t r l t ifi ti f r itr r ( tr s -
ti l) t r ti f r itr r ?
I t ri i l rsi f , it i isi l r rs, tr s-
ti ids t ti l i i t i t i i ri t
li t t t r rs ll l (r f r ) r i l r t r
r d. r t r d t tr ti t
re t l ill li , .
tely, r j t , i O Z )
t. i i l :
t t ti t li itl t ll i i l
j t t i . ,
i ject-an





l t i i y ,
t ti t r .
ti A j t i i l
t t B , A
it t , O






















l t t t re: ,
i tl t t ; ,
r . t il i ti .
2 Suppose t 0 s t l ,
l s s l z re d. se t er .m()
f l r ti al s it l
s ption t e i ld). rea
j ts , r t i s ce t t t
t r i t saction-safe) l j t. t t l , t
s , ti ts,
tti 2 r e i t i t i t i .
ifi , i t rt, t i it t t i t , it
r t 2 i ll it t O. il t i
l i itte ly tri , it ill trates f tal r l : t
fety is i ffi i t t li i t t f r li ti .
4
r t. t t j ts i ll l-
l t t t t ti i t , t ,
j t l ) r ir r t. I r - ll t
l t i i l t t t i l ti . t
, i i ll ti lt t r
, ,
.
li t l ti .
i t l i i li t l ti i






















i t i l
. t
t i t . ti . i t .
t




r tl ir t j t. ( r l t i li t f
i i l ; i i f t t ti . . f t
l it f t t i t r i r , t i r t t
t t f t t , it l t i t
l . f t l it f t t i t r i ,




Clean Bit 7 
Fp2%+pl Visible Reader 1 
'-1 
Object Header 
Data Object - 
new version Data Object - 
old version 
Figure 3. RSTM metadata. Transaction Descriptors are preallo- 
cated, one per thread (as are private read and write lists [not 
shown]). A writer acquires an object by writing the New Data 
pointer in the Object Header atomically. The Owner and Old 
Data in the Data Object are never changed after initialization. The 
"clean" bit in the Header indicates that the "new" Data Object is 
current, and that the Transaction Descriptor of its Owner may have 
been reused. Visible Readers are updated non-atomically but con- 
servatively. 
then D's Owner pointer is valid, as is W's  Transaction Descriptor, 
to which that pointer refers. If the Status field of the Descriptor is 
Committed, then D is the current version of the object. If the Status 
is Aborted, then D's Old Data pointer is valid, and the Data Object 
to which it refers (call it E )  is current. If the Status is Active, then 
no thread can read or write the object without first aborting W.4 
E's  Owner and Old Data fields are definitely garbage; while they 
may still be in use by some transaction that does not yet know it is 
doomed, they will never be accessed by a transaction that finds E 
by going through D. 
To avoid dynamic allocation, each thread reuses a single stati- 
cally allocated Transaction Descriptor across all of its transactions. 
When it finishes a transaction, the thread traverses its local write 
list and attempts to clean the objects on the list. If the transac- 
tion commits successfully, the thread simply tries to CAS the low 
bit of the New Data pointer from one to zero. If the transaction 
aborted, the thread attempts to change the pointer from a dirty ref- 
erence to D (low bit one) to a clean reference to E (low bit zero). 
If the CAS fails, then some other thread has already performed 
the cleanup operation or subsequently acquired the object. In ei- 
ther event, the current thread marks the no-longer-valid Data Ob- 
ject for eventual reclamation (to be described in Section 3.3). Once 
the thread reaches the end of its write list, it knows that there are no 
extant references to its Transaction Descriptor, so it can reuse that 
Descriptor in the next transaction. 
Because the Owner and Old Data fields of Data Objects are 
never changed after initialization, and because a Transaction De- 
scriptor is never reused without cleaning the New Data pointers in 
the Object Headers of all written objects, the status of an object 
is uniquely determined by the value of the New Data pointer (this 
assumes that Data Objects are never reused while any transaction 
might retain a pointer to them; see Section 3.3). After following 
a dirty New Data pointer and reading the Transaction Descriptor's 
Status, transaction T will attempt to clean the New Data pointer in 
the header or, if T is an eager writer, install a new Data Object. In 
either case the CAS will fail if any other transaction has modified 
the pointer in-between, in which case T will start over. 
We have designed, but not yet implemented, an extension that would allow 
readers to use the old version of the data while the current owner is Active, 
in hopes of finishing before that owner commits. 
At the beginning of a transaction, a thread sets the status of 
its Descriptor to Active. On every subsequent open of object A 
(assuming invisible readers), the thread (1) acquires A if opening it 
eagerly for write; (2) adds A to the private read list (in support of 
future validations) or write list (in support of cleanup); (3) checks 
the status word in its Transaction Descriptor to make sure it hasn't 
been aborted by some other transaction (this serves to validate 
all objects previously opened for write); and (4) incrementally 
validates allobjects previously opened for read. Validation entails 
checking to make sure that the Data Object returned by an earlier 
open operation is still valid-that no transaction has acquired the 
object in-between. 
To effect an eager acquire, the transaction: 
1. reads the Object Header's New Data pointer. 
2. identifies the current Data Object, as described above. 
3. allocates a new Data Object, copies data from the old to the 
new, and initializes the Owner and Old Data fields. 
4. uses a CAS to update the header's New Data pointer to refer to 
the new Data Object. 
5. adds the object to the transaction's private write list, so the 
header can be cleaned up on abort. 
As in DSTM, a transaction invokes a contention manager if it 
finds that an object it wishes to acquire is already owned by some 
other in-progress transaction. The manager returns an indication 
of whether the transaction's thread should abort the competitor, 
abort itself, or wait for a while in the hope that the competitor will 
complete. 
3.2 Visible Readers 
Visible readers serve to avoid the aggregate quadratic cost of incre- 
mentally validating invisible reads. A writer will abort all visible 
readers before acquiring an object, so if a transaction's status is 
still Active, it can be sure that its visible reads are still valid. At 
first blush one might think that the list of readers associated with 
an object would need to be read or written together with other ob- 
ject metadata, atomically. Indeed, recent versions of DSTM ensure 
such atomicity. We can obtain a cheaper implementation, however, 
if we merely ensure that the reader list covers the true set of visible 
readers- that it includes any transaction that has a pointer to one of 
the object's Data Objects and does not believe it needs to validate 
that pointer when opening other objects. Any other transaction that 
appears in the reader list is vulnerable to being aborted spuriously, 
but if we can ensure that such inappropriate listing is temporary, 
then obstruction freedom will not be compromised. 
To effect this heuristic covering, we reserve room in the Object 
Header for a modest number of to visible reader Transac- 
tion Descriptors. We also arrange for each transaction to maintain 
a pair of private read lists: one for objects read invisibly and one 
for objects read visibly. When a transaction T opens an object and 
wishes to be a visible reader, it reads the New Data pointer and 
identifies the current Data Object as usual. T then searches through 
the list of visible readers for an empty slot, into which it attempts to 
CAS a pointer to its own Transaction Descriptor. If it can't find an 
empty slot, it adds the object to its invisible read list (for incremen- 
tal validation). Otherwise T double-checks the New Data pointer 
to detect races with recently arriving writers, and adds the object to 
its visible read list (for post-transaction cleanup). If the New Data 
pointer has changed, T aborts itself. 
For its part, a writer peruses the visible reader list immediately 
before acquiring the object, asking the contention manager for per- 
mission to abort each reader. If successful, it peruses the list again 
immediately afer acquiring the object, aborting each transaction 
it finds. Because readers double-check the New Data pointer after 
Status
l it t
I New Data Owner
I i i l I Old Data
IVisible Reader n I
i
j t r ject-
i





r ti l .
' , ' ,
itt , f f
t , ' ,
. f ,
4
' l t i l i it l ; il t
s





t l it t l t l it .
,
t t, t t t t l li t
t l . ).
t,
i t i t t t ti .
l s
,
t j t ll itt j t , t t t j t
s t
i t t i i t t t ; ti . . t ll i
'
t t s,
, t r, l t.
it t ill il i t t ti ifi
, r.
4 i ned, t t t i l t , t i t t l ll
r t t i e,
i ing t its.
5
i f f

































ft i i j t,
i s.
200615115
adding themselves to the reader list, and writers peruse the reader 
list after changing the New Data pointer, there is no chance that a 
visible reader will escape a writer's notice. 
After finishing a transaction, a thread t uninstalls itself from 
each object in its visible read list. If a writer w peruses the reader 
list before t completes this cleanup, w may abort a transaction be- 
ing executed by t at some arbitrary subsequent time. However, be- 
cause t removes itself from the list before starting another transac- 
tion, the maximum possible number of spurious aborts is bounded 
by the number of transactions in the system. In practice we can 
expect such aborts to be extremely rare. 
3.3 Dynamic Storage Management 
While RSTM requires no dynamic memory allocation for Object 
Headers, Transaction Descriptors, or (in the common case) private 
read and write lists, it does require it for Data Objects. As noted 
in Section 3.1, a writer that has completed its transaction and 
cleaned up the headers of acquired objects knows that the old (if 
committed) or new (if aborted) versions of the data will never be 
needed again. Transactions still in progress, however, may still 
access those versions for an indefinite time, if they have not yet 
noticed the writer's status. 
In STM systems for Java, C#, and Haskell, one simply counts 
on the garbage collector to eventually reclaim Data Objects that are 
no longer accessible. We need something comparable in C++. In 
principle one could create a tracing collector for Data Objects, but 
there is a simpler solution: we mark superseded objects as "retired" 
but we delay reclamation of the space until we can be sure that it is 
no longer in use by any extant transaction. 
Each thread in RSTM maintains a set of free lists of blocks for 
several common sizes, from which it allocates objects as needed. 
Threads also maintain a "limbo" list consisting of retired objects. 
During post-transaction cleanup, a writer adds each deallocated 
object to the limbo list of the thread that initially created it (the 
Owner field of the Data Object suffices to identify the creator). To 
know when retired objects can safely be reclaimed, we maintain a 
global timestamp array that indicates, for every thread, the serial 
number of the current transaction (or zero if the thread is not 
in a transaction). Periodically each thread captures a snapshot of 
the timestamp array, associates it with its limbo list, and starts 
a new list. It then inspects any lists it captured in the past, and 
reclaims the objects in any lists that date from a previous "epochn- 
i.e., those whose associated snapshot is dominated by the current 
timestamp. Similar storage managers have been designed by Fraser 
for OSTM [4, Section 5.2.31 and by Hudson et al. for McRT [17]. 
As described in more detail in Section 3.4 below, the RSTM 
API includes a c l o n e 0  method that the user can ovenide, if 
desired, to create new copies of Data Objects in some application- 
specific way (the default implementation simply copies bits, and 
must be overridden for objects with internal pointers or when deep 
copying is needed). The runtime also keeps transaction-local lists 
of created and deleted objects. On commit we move "deleted" 
objects to the appropriate limbo list, making them available for 
eventual reclamation. On abort, we reclaim (immediately) all new1 y 
created objects (they're guaranteed not to be visible yet to any other 
transaction), and forget the list of objects to be deleted. This defers 
allocation and reclamation to the end of a transaction, and preserves 
isolation. 
3.4 C++ API 
RSTM currently works only for programs based on p threads .  
Any shared object must be of class Shared<T>, where T is a type 
descended from Object<T>. Both Object<T> and Shared<T> 
live in namespace stm. A pthread must call s t m :  : i n i t  0 before 
executing its first transaction. 
Outside a transaction, the only safe reference to a sharable ob- 
ject is a Shared<T>*. Such a reference is opaque: no T opera- 
tions can be performed on a variable of type Shared<T>. Within 
a transaction, however, a transaction can use the open-R0 0 and 
open-RWO methods of Shared<T> to obtain pointers of type 
cons t  T* and T*, respectively. These can be safely used only 
within the transaction; it is incorrect for a program to use a pointer 
to T or to one of T's fields from non-transactional code. 
Transactions are bracketed by BEGIN-TRANSACTION. . . END- 
TRANSACTION macros. These initialize and finalize the trans- 
action's metadata. They also establish a handler for the s t m :  : 
abor ted  exception, which is thrown by RSTM in the event of fail- 
ure of an open-time validation or commit-time CAS. We currently 
use a subs"mption model for transaction nesting. 
Changes made by a transaction using a T* obtained from 
open-RW () will become visible if and only if the transaction com- 
mits. Moreover if the transaction commits, values read through a 
cons t  T* or T* pointer obtained from open-R0 0 or open-RW 0 
are guaranteed to have been valid as of the time of the commit. 
Changes made to any other objects will become visible to other 
threads as soon as they are written back to memory, just as they 
would in a nontransactional program; transactional semantics ap- 
ply only to Shared<T> objects. Nontransactional objects avoid 
the cost of bookkeeping for variables initialized within a trans- 
action and ignored outside. They also allow a program to "leak" 
information out of transactions when desired, e.g. for debugging or 
profiling purposes. It is the programmer's responsibility to ensure 
that such ieaks do not compromise program correctness. 
In a similar vein, an early release operation [13] allows a trans- 
action to "forget" an object it has read using open-R0 0 ,  thereby 
avoiding conflict with any concurrent writer and (in the case of 
invisible reads) reducing the overhead of incremental validation 
when opening additional objects. Because it disables automatic 
consistency checking, early release should be used only when the 
programmer is sure that it will not compromise correctness. 
Shared<T> objects define the granularity of concurrency in a 
transactional program. With eager conflict detection, transactions 
accessing sets of objects A and B can proceed in parallel so long 
as A n B is empty or consists only of objects opened in read-only 
mode. Conflicts between transactions are resolved by a contention 
manager. The results in Section 4 use our "Polka" contention man- 
ager [27]. 
Storage Management. Class Shared<T> provides two con- 
structors: Shared<T> () creates a new T object and initializes it 
using the default (zero-argument) constructor. Shared<T> (T*) 
puts a transaction-safe opaque wrapper around a pre-existing T, 
which the programmer may have created using an arbitrary con- 
structor. Later, Shared<T> : : opera tor  d e l e t e  will reclaim the 
wrapped T object; user code should never delete this object directly. 
Class Object<T>, from which T must be derived, overloads 
opera tor  new and o p e r a t o r  d e l e t e  to use the memory man- 
agement system described in Section 3.3. If a T constructor needs 
to allocate additional space, it must use the C++ placement new 
in conjunction with special malloc and f r e e  routines, avail- 
able in namespace stm-gc. For convenience in using the Stan- 
dard Template Library (STL), these are readily encapsulated in an 
a l l o c a t o r  object. 
As described in Section 3.3, RSTM delays updates until commit 
time by performing them on a "clone" of a to-be-written object. 
By default, the system creates these clones via bit-wise copy. The 
user can alter this behavior by overriding Object<T>: : c l o n e ( ) .  
If any action needs to be performed when a clone is discarded, 
the user should also override Obj ect<T> : : d e a c t i v a t e  (1. The 
default behavior is a no-op. 
i t s l es t t r list, rit r t r
li t ft i t t i ter, t r is t t
i i l r r ill s e ri ti e.
ft fi ishing t ti , t r t i t ll it fr
j t i its i i l r list. I rit r t r
li t ef re t c l t t is cl , a a rt a tr s ti e-
i c te t t s e r itr r s s e t ti e. ever, -
s t r es it l fr t li t t rti t tr s -
ti , t i i l s ri rt is
t tr s ti s i t s st . I r ti
t s rt t tr el r re.
.3 ic t rage anage t
hil r ir s a ic e ory all ti f r bject
ea ers, ra s ti escri t rs, r (i t e c case) rivate
r a rit lists, it es r ir it f r t bjects. s ted
i ecti . , a rit r t at as c l t its tr s ti a
cle up the hea ers o acquire objects kno s that the ol (i
c itt ) r (i a rte ) ersi ns t e ata ill e er
ee e a ain. r s ti still i r r ss, o ever, a stil
access t s ersi ns f r a i efinite ti e, i t ey a e t et
ti t rit ' status.
I s st s f r Java, #, a askel , e si ply c ts
on the garba e coll t r to eventual y reclai ata bjects that are
no longer accessible. e need so ethi co para l in ++. In
pri ci l one co l create a traci collect r for ata bjects, b t
there is a si pler solution: e ar supers objects as "retir
b t e delay recla ati o the space until e can be sure t at it is
no l er in use b any extant transaction.
Each thread in S aintai s a set of free lists of blocks for
several c sizes, fro hich it all cates objects as needed.
hreads also aintai a "li list consisti o retired objects.
uri post-tr ti cleanup, a riter adds each deall cat
object to the li bo list o the thread that initi ll create it (the
ner field the at bject suffices to identify the creator). o
kno hen retire objects can safely be reclai ed, e aintain a
global ti estamp array that indicates, for every thread, the serial
nu ber of the current transacti n (or zero if the thread is not
in a transaction). Periodical y each thread captures a snapshot of
the ti esta array, associates it ith its li bo list, and starts
a ne list. It then inspects any lists it captured in the past, and
reclai s the objects in any lists that date fro a previous " "-
i.e., those hose associated snapshot is do inated b the curre t
ti esta p. Si ilar storage anagers have been designed by Fraser
for S [4, Section 5.2.3] and by udso eta!' for cRT [17].
s described in ore detail in Section 3.4 belo , the S
PI includes a clone 0 ethod that the user can override, if
desired, to create ne copies of ata bjects in so e application-
specific ay (the default i ple entati si ply copies bits, and
ust be overridden for objects ith internal pointers or hen deep
copying is needed). The runti e also keeps transaction-local lists
of created and deleted objects. On co it e ove "delete "
objects to the appropriate li bo list, aking the available for
eventual recla ation. n abort, e reclai (i ediately) al ne ly
created objects (they're guaranteed not to be visible yet to any other
transaction), and forget the list of objects to be deleted. This defers
al ocation and recla ation to the end of a transaction, and preserves
isolation.
3.4 C++ I
RST cur ently works only for progra s based on pthreads.
Any shared object must be of clas Shared<T>, where T is a type
descended from Object<T>. Both Object<T> and Shared<T>
live in na espace stm. A pthread must cal stm: : inito before
executing its first transaction.
6
tsi e tr ti , t l s fe r f r t s arable -
is hared<T>*. r is e: r -
ti r ri l t e >. it i
t ti , ever, t ti t en_ O
_ t hared<T> t t i i t r t
st * a , respectively. es c saf l s l
it i t tr ti ; it is i f r r r t i t
t t ' fields fr -t t e.
r t r r t I _ I .. _
I r s. i it fi alize t tr s-
t t t . l st li f r t t : :
rted e cepti , i is t r i t e e t fail-
re a e -ti ali ti r c it-t S. e c rrently
se a s su odel f r tr s ti esti g.
hanges a a tr s ti si g a * t i fr
e _ 0 ill isi l i a l i t tr s ti c -
its. oreover i t tr s ti c its, al es r t r a
const * or * poi t r obtai e fro open_ O () or o _R ()
are ara t t ave ee ali as t ti t e c it.
a es ade t a y t bjects ill isi l t t
t r s as s n as t are ritt t e ory, j as t
l i a tra sacti al r r ; tr sacti al se a ti ap-
l ly t hared<T> jects. tr s ti l jects a oid
the cost o bookkeepi for variables initiali it i a trans-
action and ignored outside. he also allo a pr r to "l
infor ation out o tra sacti s hen desired, e.g. for debugging or
profiling purposes. It is the pr responsibilit to ensure
t at such l s do not co pr is pr r correctness.
I a si ilar vein, an early release operati [13] all s a trans-
action to "forget" an object it has read using open_ O () , thereby
avoiding conflict ith any concurre t riter and (i the case o
invisible reads) re ci the overhead o incre ental vali ati
hen openi additional objects. eca s it disa les aut ati
consistency checki , early rel s should be use only hen the
pr r er is sure t t it il not co pro ise correct ess.
Shared<T> objects define the granularity of concurrency in a
transactional progra . ith eager conflict detection, transacti s
accessing sets of objects and can procee in paral el so long
as n is e pty or consists only of objects opened in read-only
ode. onflicts bet ee transactions are resolved by a contention
anager. The results in Section 4 use our "Pol contenti an-
ager [27].
Storage anagement. lass Shared<T> provides t o con-
structors: Shared<T> 0 creates a ne T object and initializes it
using the default (zero-argu ent) constructor. Shared<T> ( *)
puts a transaction-safe opaque rapper around a pre-existi T,
hich the progra er ay have created using an arbitrary con-
structor. Later, Shared<T>: : operator delete il reclai the
rapped T object; user code should never delete this object directly.
Class bject<T>, fro hich T ust be derived, overloads
operator ne and operator delete to use the e ory an-
age ent syste described in Section 3.3. If a T construct r needs
to al ocate additional space, it ust use the C++ pl t ne
in conjunction ith special alloc and free routines, avail-
able in na espace st _ For convenience in using the Stan-
dard Template Library (STL), these are readily encapsulate in an
allocator object.
s described in Section 3.3, RST delays updates until co it
ti e by perfor ing the on a "clo e of a to-be- ritten object.
By default, the syste creates these clones via bit-wise copy. The
user can alter this behavior by over iding Obj ect<T>: : clone () .
If any action needs to be perfor ed when a clone is discarded,
the user should also over ide Obj ect<T>: :deactivate (). The
default behavior is a no-op.
2006/5115
void intset::insert(int val) 1 
BEGIN-TRANSACTION; 
const node* previous = head->open-ROO; 
// points to sentinel node 
const node* current = previous->next->open-ROO; 
/ /  points to first real node 
while (current != NULL) C 
if (current->val >= val) break; 
previous = current; 
current = current->next->open-ROO; 
1 
if (!current I I current->val > val) C 
node* n = new node(va1, current->shared()); 
// uses Object<T>::operator new 
previous->open-RWO->next = new Shared<node>(n); 
END-TRANSACTION; 
1 
Figure 4. Insertion in a sorted linked list using RSTM. 
Calls to stm-gc: :malloc, stm-gc: : f r e e ,  Object<T>: : 
o p e r a t o r  new, Shared<T>::operator  new, and Shared<T> 
: :opera tor  d e l e t e  become permanent only on commit. The 
first two calls (together with placement new) allow the program- 
mer to safely allocate and deallocate memory inside transactions. 
If abort-time cleanup is required for some other reason, RSTM pro- 
vides an ON-RETRY macro that can be used at the outermost level 
of a transaction: 
BEGIN-TRANSACTION; 
// transaction code goes here 
ON-RETRY C 
// cleanup code goes here 
1 
END-TRANSACTION; 
An Example. Figure 4 contains code for a simple operation on 
a concurrent linked list. It assumes a singly-linked node class, 
for which the default c l o n e 0  and d e a c t i v a t e 0  methods of 
Obj ect<node> suffice. 
Because node : : next  must be of type Shared<node>* rather 
than node*, but we typically manipulate objects within a trans- 
action using pointers obtained from open-R0 () and open-RW 0 , 
Object<T> provides a s h a r e d 0  method that returns a pointer to 
the Shared<T> with which t h i s  is associated. 
Our code traverses the list, opening objects in read-only mode, 
until it finds the proper place to insert. It then re-opens the object 
whose nex t  pointer it needs to modify in read-write mode. For 
convenience, Ob j ect<T> provides an open-RW 0 method that re- 
turns t h i s - >  s h a r e d 0  -> open-RW 0.  The list traversal code de- 
pends on the fact that open-ROO and open-RW 0 return NULL 
when invoked on a Shared<T> that is already NULL. 
A clever programmer might observe that in this particular ap- 
plication there is no reason to insist that nodes near the beginning 
of the list remain unchanged while we insert a node near the end of 
the list. It is possible to prove in this particular application that our 
code would still be linearizable if we were to release these early 
nodes as we move past them [13]. Though we do not use it in Fig- 
ure 4, Object<T> provides a r e l e a s e  0 method that constitutes a 
promise on the part of the programmer that the program will still be 
correct if some other transaction modifies t h i s  before the current 
transaction completes. Calls to r e l e a s e 0  constitute an unsafe op- 
timization that must be used with care, but can provide significant 
performance benefits in certain cases. 
4. Performance Results 
In this section we compare the performance of RSTM to coarse- 
grain locking (in C++) and to our ASTM on a series of microbench- 
marks. Our results show that RSTM outperforms Java ASTM in all 
tested microbenchmarks. Given our previous results [21], this sug- 
gests that it would also outperform both DSTM and OSTM. At 
the same time, coarse-grain locks remain significantly faster than 
RSTM at low levels of contention. Within the RSTM results, we 
evaluate tradeoffs between visible and invisible readers, and be- 
tween eager and lazy acquire. We also show that an RSTM-based 
linked list implementation that uses early release outperforms a 
fine-grain lock based implementation even with low contention. 
Evaluation Framework. Our experiments were conducted on a 
16-processor SunFire 6800, a cache-coherent multiprocessor with 
1.2GHz UltraSPARC I11 processors. RSTM and C++ ASTM were 
compiled using GCC v3.4.4 at the -03  optimization level. The 
Java ASTM was tested using the Java 5 HotSpot VM. Experiments 
with sequential and coarse-grain locking applications show similar 
performance for the ASTM implementations: any penalty Java 
pays for run-time semantic checks, virtual method dispatch, etc., is 
overcome by aggressive just-in-time optimization (e.g., inlining of 
functions from separate modules). We measured throughput over 
a period of 10 seconds for each benchmark, varying the number 
of worker threads from 1 to 28. Results were averaged over a set 
of 3 test runs. In all experiments we used our Polka contention 
manager for ASTM and RSTM [27]. We tested RSTM with each 
combination of eagerllazy acquire and visiblelinvisible reads. 
Benchmarks. Our microbenchmarks include three variants of an 
integer set (a sorted linked list, a hash table with 256 buckets, 
and a red-black tree), an adjacency list-based undirected graph, 
and a web cache simulation using least-frequently-used page re- 
placement (LFUCache). In the integer set benchmarks every active 
thread performs a 1:l:l  mix of insert, delete, and lookup opera- 
tions. The graph benchmark performs a 1:l mix of vertex insert 
and remove operations. 
In the LinkedList benchmark, transactions traverse a sorted list 
to locate an insertionldeletion point, opening list nodes in read- 
only mode, and early releasing them after proceeding down the 
list. Once found, the target node is reopened for read-write access. 
The values in the linked list nodes are limited to the range 0..255. 
The HashTable benchmark consists of 256 buckets with overflow 
chains. The values range from 0 to 255. Our tests perform roughly 
equal numbers of insert and delete operations, so the table is about 
50% full most of the time. In the red-black tree (FU3Tree) a transac- 
tion first searches down the tree, opening nodes in read-only mode. 
After the target node is located the transaction opens it in read-write 
mode and goes back up the tree opening nodes that are relevant to 
the height balancing process (also in read-write mode). Our RBTree 
workload uses node values in the range 0 . .  65535. 
In the random graph (RandomGraph) benchmark, each newly 
inserted vertex initially receives up to 4 randomly selected neigh- 
bors. Vertex neighbor sets change over time as existing nodes are 
deleted and new nodes join the graph. The graph is implemented 
as a sorted adjacency list. A transaction looks up the target node to 
modify (opening intermediate nodes in read-only mode) and opens 
it in read-write mode. Subsequently, the transaction looks up each 
affected neighbor of the target node, and then modifies that neigh- 
bor's neighbor list to inserudelete the target node in that list. Trans- 
actions in RandomGraph are quite complex. They tend to overlap 
heavily with one another, and different transactions may open the 
same nodes in opposite order. 
LFUCache [26] uses a large (2048-entry) array-based index and 
a smaller (255-entry) priority queue to track the most frequently 
accessed pages in a simulated web cache. When re-heapifying the 
i i t t::i rt(i t l) {
I _T I ;
t . r i ad->open_RO();
II i t to ti l
t . rr t r ious->next->open_R ();
II i o
hil (c rre t ! ) {




!c re t I I t l l {
. ew e(val, rr t- s ared());
II bj t<T : : er t r





lls _ : l , _ : , :
t r , ared<T>: operator , red<T>
: : t
irst t ll t t r








_ S CTI ;
le.
r t i t.
r lone 0 tivate 0
j t node> ffice.
: : de>*
t




til it i s t.
i , t<T> _
t r s is->sharedO->open_ O.










t l tes. lease 0
t ,
fits i t i .
7
. f s lts
In this section e co pare the perfor ance f to coarse-
r i l i (i ) t r s ri s f i r -
ar s. r res lts s t at t erf r s Ja a i all
t st i r r s. i r r i s r s lts [ ], t is s -
t t t it l l t rf r t . t
t s ti , rs - r i l s r i si ifi tl f st r t
t l l ls f t ti . it i t r s lts,
l t t t i i l i i i l ,
t r l ir . ls s t t - s
li li t i l t ti t t l l t
fi - r i l i l t ti it l t ti .
l ti . r ri t r t
- r r ir , - r t lti r ss r it
II
il i . . t t 3 ti i ti l l.
t t i t t t . i t
it ti l i l i li ti i il
ti ti , i t l t i t , t ., i






















ll t i it t t t t tl
2006/5115
queue, we always swap a value-one node with any value-one child; 
this induces hysteresis and gives a page a chance to accumulate 
cache hits. Pages to be accessed are randomly chosen from a Zipf 
distribution with exponent 2. So, for page i, the cumulative proba- 
bility of a transaction accessing that page is p,( i )  cc x,<j6i j-'.
4.1 Speedup 
Speedup graphs appear in Figures 5 through 9. The y axis in each 
Figure plots transactions per second on a log scale. 
Comparison with ASTM. In order to provide a fair evaluation of 
RSTM against ASTM, we present results for two different ASTM 
runtimes. The first, Java ASTM, is our original system; the second 
reimplements it in C++. The C++ ASTM and RSTM implemen- 
tations use the same allocator, bookkeeping data structures, con- 
tention managers, and benchmark code; they differ only in meta- 
data organization. Consequently, any performance difference is a 
direct consequence of metadata design tradeoffs. 
RSTM consistently outperforms Java ASTM. We attribute this 
performance to reduced cache misses due to improved metadata 
layout; lower memory management overhead due to static trans- 
action descriptors, merged Locator and Data Object structures, and 
efficient epoch-based collection of Data Objects; and more efficient 
implementation of private read and write sets. ASTM uses a Java 
HashMap to store these sets, whereas RSTM places the first 64 en- 
tries in preallocated space, and allocates a single dynamic block 
for every additional 64 entries. The HashMap makes lookups fast, 
but RSTM bundles lookup into the validation traversal, hiding its 
cost in the invisible reader case. Lookups become expensive only 
when the same set of objects is repeatedly accessed by a transaction 
in read-only mode. Overall, RSTM has significantly less memory 
management overhead than ASTM. 
When we consider the C++ ASTM, we see that both language 
choice and metadata layout are important. In RandomGraph, C++ 
ASTM gives an order of magnitude improvement over Java, though 
it still fares much worse than RSTM. HashTable, REiTree, and 
LFUCache are less dramatic, with C++ ASTM offering only a 
small constant improvement over Java. We attribute the unexpect- 
edly close performance of Java and C++ ASTM primarily to the 
benefit that HotSpot compilation and dynamic inlining offers, and 
suspect that RandomGraph's poor performance in Java ASTM is 
due to the cost of general-purpose garbage collection for large, 
highly connected data structures, as opposed to our lightweight 
reclamation scheme in C++ ASTM. 
Surprisingly, C++ ASTM slightly outperforms RSTM in the 
LinkedList benchmark. This difference is due to a minor difference 
in how the two systems reuse their descriptor objects. In C++ 
ASTM, a transaction does not clean up the objects it acquires 
on commit, while in RSTM it does. Since it is highly likely that 
transactions will overlap, the RSTM cleaning step will likely be 
redundant, but will cause cache misses in all transactions when 
they next validate. This manifests as a small constant overhead in 
RSTM. 
Coarse-Grain Locks and Scalability. In all five benchmarks, 
coarse-grain locking (CGL) is significantly faster than RSTM at 
low levels of contention. The performance gap ranges from 2X (in 
the case of HashTable, Figure 6), to 20X (in case of RandomGraph, 
Figure 8). Generally, the size of the gap is proportional to the length 
of the transaction: validation overhead (for invisible reads and for 
lazy acquire) and contention due to bookkeeping (for visible reads) 
increase with the length of the transaction. We are currently explor- 
ing several heuristic optimizations (such as the conflicts counter 
idea of Lev and Moir [18]) to reduce these overheads. We are also 
exploring both hardware and compiler assists. 
le+07 . 
Java ASTM - : 
C++ ASTM ---- c ---- . 
VisIEager --..- m.... 
InvislEager 
. xx.. InvislLazy 0.- - - 
Threads 
Figure 5. RBTree. Note the log scale on the y axis in all perfor- 
mance graphs. 
100000 
0 5 10 15 20 25 30 
Threads 
Figure 6. HashTable. 
Wlth increasing numbers of threads, RSTM quickly overtakes 
CGL in benchmarks that permit concurrency. The crossover oc- 
curs with as few as 3 concurrent threads in HashTable. For Rl3- 
Tree, where transactions are larger, RSTM incurs significant book- 
keeping and validation costs, and the crossover moves out to 7- 
14 threads, depending on protocol variant. In LinkedList the faster 
RSTM variants match CGL at 14 threads; the slower ones can- 
not. In the LFUCache and RandomGraph benchmarks, neither of 
which admit any real concurrency among transactions, CGL is al- 
ways faster than transactional memory. 
RSTM shows continued speedup out to the full size of the 
machine (16 processors) in RBTree, HashTable and LinkedList . 
LFUCache and RandomGraph, by contrast, have transactions that 
permit essentially no concurrency. They constitute something of 
a "stress test": for applications such as these, CGL offers all the 
concurrency there is. 
Comparison with Fine-Grain Locks. To assess the benefit of 
early release, we compare our LinkedList benchmark to a "hand- 
over-hand" fine-grain locking (FGL) implementation in which each 
list node has a private lock that a thread must acquire in order to 
access the node, and in which threads release previously-acquired 
locks as they advance through the list. Figure 7 includes this ad- 

































































Pc(i ex L:O<j9 -2.
8 2006/5/15
Java ASTM - : 
C++ ASTM ----A ---- ' 




. x Coarse-Grained Locks -...n.-.. 
Fine-Grained Locks -----+.... :
10000 1 I 
0 5 10 15 20 25 30 
Threads 
Figure 7. LinkedList with early release. 
nificantly better than that of RSTM. With increasing concurrency, 
however, the versions of RSTM with invisible reads catch up to and 
surpass FGL. 
Throughput for FGL drops dramatically when the thread count 
exceeds the number of processors in the machine. At any given 
time, several threads hold a lock and the likelihood of lock holder 
preemption is high; this leads directly to convoying. A thread that 
waits behind a preempted peer has a high probability of waiting 
behind another preempted peer before it reaches the end of the list. 
The visible read RSTMs start out performing better than the 
invisible read versions on a single thread, but their relative per- 
formance degrades as concurrency increases. Note that both &- 
ble read transactions and the FGL implementation must write to 
each list object. This introduces cache contention-induced over- 
head among concurrent transactions. Invisible read-based transac- 
tions scale better because they avoid this overhead. 
Conflict Detection Variants. Our work on ASTM [21] contained 
a preliminary analysis of eager and lazy acquire strategies. We con- 
tinue that analysis here. In particular, we identify a new kind of 
workload, exemplified by RandomGraph (Figure 8), in which lazy 
acquire outperforms eager acquire. The CGL version of Random- 
Graph outperforms RSTM by a large margin; we attribute the rel- 
atively poor performance of RSTM to high validation and book- 
keeping costs. ASTM performs worst due to its additional memory 
management overheads. 
In RBTree, LFUCache, and the two LinkedList variants, visi- 
ble readers incur a noticeable penalty in moving from one to two 
threads. The same phenomenon occurs with fine-grain locks in 
LinkedList with early release. We attribute this to cache invalida- 
tions caused by updates to visible reader lists (or locks). The ef- 
fect does not appear (at least not as clearly) in RandomGraph and 
HashTable, because they lack a single location (tree root, list head) 
accessed by all transactions. Visible readers remain slower than in- 
visible readers at all thread counts in RBTree and LinkedList with 
early release. In HashTable they remain slightly slower out to the 
size of the machine, at which point the curves merge with those 
of invisible readers. Eager acquire enjoys a modest advantage over 
lazy acquire in these benchmarks (remember the log scale axis); it 
avoids performing useless work in doomed transactions. 
For a single-thread run of RandomGraph, the visible read ver- 
sions of RSTM slightly outperform the invisible read versions pri- 
marily due to the cost of validating a large number of invisibly 
read objects. With increasing numbers of threads, lazy acquire ver- 
sions of RSTM (for both visible and invisible reads) outperform 
I 
1 1  I 
0 5 10 15 20 25 30 
Threads 
Figure 8. RandomGraph. 
100000 I I 
0 5 10 15 20 25 30 
Threads 
Figure 9. LFUCache. 
their eager counterparts. The eager versions virtually livelock: The 
window of contention in eager acquire versions is significantly 
larger than in lazy acquire versions. Consequently, transactions are 
exposed to transient interference, expend considerable energy in 
contention management, and only a few can make progress. With 
lazy acquire, the smaller window of contention (from deferred ob- 
ject acquisition) allows a larger proportion of transactions to make 
progress. The visible read version starts with a higher throughput 
at one thread, but the throughput reduces considerably due to cache 
contention with increasing concurrency. The invisible read version 
starts with lower throughput, which increases slightly since there is 
no cache contention overhead. Note that we cannot achieve scala- 
bility in RandomGraph since all transactions modify several nodes 
scattered around in the graph; they simultaneously access a large 
number of nodes in read-only mode (due to which there is signifi- 
cant overlap between read and write sets of these transactions). 
The poor performance of eager acquire in RandomGraph is a 
partial exception to the conclusions of our previous work [27], in 
which the Polka contention manager was found to be robust across 
a wide range of benchmarks. This is because Polka assumes that 
writes are more important than reads, and writers can freely clob- 
ber readers without waiting for the readers to complete. The as- 
sumption works effectively for transactions that work in read-only 
followed by write-only phases, because the transaction in its write- 







I vis/ zy ---0--.-
x ViS/Lazy --.•.-..




















X'~")("X'~"X">E"X' ....~ ....x- '..-x..-._.~
































only phase is about to complete when it aborts a competing reader. 
However, transactions in RandomGraph intersperse multiple writes 
within a large series of reads. Thus, a transaction performing a write 
is likely to do many reads thereafter and is vulnerable to abortion 
by another transaction's write. 
Transactions in LFUCache (Figure 9) are non-trivial but short. 
Due to the Zipf distribution, most transactions tend to write to the 
same small set of nodes. This basically serializes all transactions as 
can be seen in Figure 9. Lazy variants of RSTM outperform ASTM 
(as do eager variants with fewer than 15 threads), but coarse-grain 
locking continues to outperform RSTM. In related experiments (not 
reported in this paper) we observed that the eager RSTMs were 
more sensitive to the exponential backoff parameters in Polka 
than the lazy RSTMs, especially in write-dominated workloads 
such as LFUCache. With careful tuning, we were able to make 
the eager RSTMs perform almost as well as the lazy RSTMs up 
to a certain number of threads; after this point, the eager RSTMs' 
throughput dropped off. This reinforces the notion that transaction 
implementations that use eager acquire semantics are generally 
more sensitive to contention management than those that use lazy 
acquire. 
Summarizing, we find that for the microbenchmarks tested, and 
with our current contention managers (exemplified by Polka), invis- 
ible readers outperform visible readers in most cases. Noteworthy 
exceptions occur in the single-threaded case, where visible readers 
avoid the cost of validation without incurring cache misses due to 
contention with peer threads; and in RandomGraph, where a write 
often forces several other transactions to abort, each of which has 
many objects open in read-only mode. Eager acquire enjoys a mod- 
est advantage over lazy acquire in scalable benchmarks, but lazy ac- 
quire has a major advantage in RandomGraph and (at high thread 
counts) in LFUCache. By delaying the detection of conflicts it dra- 
matically increases the odds that some transaction will succeed. 
None of our RSTM contention managers currently take advan- 
tage of the opportunity to arbitrate conflicts between a writer and 
pre-existing visible readers. Exploiting this opportunity is a topic 
of future work. It is possible that better policies may shift the per- 
formance balance between visible and invisible readers. 
5. Conclusions 
In this paper we presented RSTM, a new, low-overhead software 
transactional memory for C++. In comparison to previous non- 
blocking STM systems, RSTM: 
1. uses static metadata whenever possible, significantly reducing 
the pressure on memory management. The only exception is 
private read and write lists for very large transactions. 
2. employs a novel metadata structure in which headers point di- 
rectly to objects that are stable (thereby reducing cache misses) 
while still providing constant-time access to objects that are be- 
ing modified. 
3. takes a novel conservative approach to visible reader lists, min- 
imizing the cost of insertions and removals. 
4. provides a variety of policies for conflict detection, allowing the 
system to be customized to a given workload. 
Like OSTM, RSTM employs a lightweight, epoch based garbage 
collection mechanism for dynamically allocated structures. Like 
DSTM, it employs modular, out-of-band contention management. 
Experimental results show that RSTM is significantly faster than 
our Java-based ASTM system, which was shown in previous work 
to match the faster of OSTM and DSTM across a variety of bench- 
marks. 
Our experimental results highlight the tradeoffs among conflict 
detection mechanisms, notably visible vs. invisible reads, and eager 
vs. lazy acquire. Despite the overhead of incremental validation, 
invisible reads appear to be faster in most cases. The exceptions 
are large uncontended transactions (in which visible reads induce 
no extra cache contention), and large contended transactions that 
spend significant time reading before performing writes that con- 
flict with each others' reads. For these latter transactions, lazy ac- 
quire is even more important: by delaying the resolution of conflicts 
among a set of complex transactions, it dramatically increases the 
odds of one of them actually succeeding. In smaller transactions 
the impact is significantly less pronounced: eager acquire some- 
times enjoys a modest performance advantage; much of the time 
they are tied. 
The lack of a clear-cut policy choice suggests that future work is 
warranted in conflict detection policy. We plan to develop adaptive 
strategies that base the choice of policy on the characteristics of the 
workload. We also plan to develop contention managers for RSTM 
that exploit knowledge of visible readers. The high cost of both 
incremental validation and visible-reader-induced cache contention 
suggests the need for additional work aimed at reducing these 
overheads. We are exploring both alternative software mechanisms 
and lightweight hardware support. 
Though STM systems still suffer by comparison to coarse-grain 
locks in the low-contention case, we believe that RSTM is one step 
closer to bridging the performance gap. With additional improve- 
ments, likely involving both compiler support and hardware accel- 
eration, it seems reasonable to hope that the gap may close com- 
pletely. Given the semantic advantages of transactions over locks, 
this strongly suggests a future in which transactions become the 
dominant synchronization mechanism for multithreaded systems. 
Acknowledgments 
The ideas in this paper benefited from discussions with Sand- 
hya Dwarkadas, Arrvindh Shriraman, and Vinod Sivasankaran. We 
would also like to thank the anonymous reviewers for many helpful 
suggestions. 
References 
[I] C. S. Ananian, K. Asanovic, B. C. Kuszmaul, C. E. Leiserson, and 
S. Lie. Unbounded Transactional Memory. In Proc. of the 1 l th lntl. 
Symp. on High Performance Computer Architecture, pages 3 16-327, 
San Francisco, CA, Feb. 2005. 
[2] R. Ennals. Software Transactional Memory Should Not Be 
Obstruction-Free. unpublished manuscript, Intel Research Cam- 
bridge, 2005. Available as http://www.cambridge.intel-research.net/ 
~ennals/notlockfree.pdf. 
[3] K. Fraser and T. Harris. Concurrent Programming Without Locks. 
Submitted for pubIication, 2004. Available as research.microsoft.com/ 
tharrisldraftslcpwl-submission.pdf. 
[4] K. Fraser. Practical Lock-Freedom. Ph. D. dissertation, UCAM- 
CL-TR-579, Computer Laboratory, University of Cambridge, Feb. 
2004. 
[5] R. Guerraoui, M. Herlihy, and B. Pochon. Polymorphic Contention 
Management in SXM. In Proc. of the 19th Intl. S y t p .  on Distributed 
Cotnputing, Cracow, Poland, Sept. 2005. 
[6] R. Guerraoui, M. Herlihy, M. Kapalka, and B. Pochon. Robust 
Contention Management in Software Transactional Memory. In 
Proc., Workrhop on Synchronization and Concurrency in Objecr- 
Oriented Languages, San Diego, CA, Oct. 2005. In conjunction with 
OOPSLAO5. 
[7] R. Guerraoui, M. Herlihy, and B. Pochon. Toward a Theory of 
Transactional Contention Managers. In Proc. of the 24th ACM Symp. 


























[1} c. . i , . i , . . l, . . i ,
. . l . f 1t I i.
. - ,
, ,
} l . r
tion-Fre .
e, . . bridge.intel-research.net/
.en s/notlockfre .pdf.
} . ris. t .
itt r l ti , . l
"'lha ri / afts/cpwl-submis ion.pdf.
} . r. l r o . . . t ti n,
, t r , , .
.
[ } . rr i, . rli , . . l r ic t ti
. . f tI. ym .
m ting, , , t. .
[6} . i, i , l , . .
r l r .
., s op t-
, , , . .
05.
[ } . i, . i , . .
rs. . f .
f ti , ,
.
200615115
[8] L. Hammond, V Wong, M. Chen, B. Hertzberg, B. Carlstrom, M. 
Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun. Transactional 
Memory Coherence and Consistency. In Proc. of the 31st Intl. Sytnp. 
on Cotnputer Architecture, Miinchen, Germany, June 2004. 
191 T. Harris and K. Fraser. Language Support for Lightweight 
Transactions. In OOPSLA 2003 Conf. Proc., Anaheim, CA, Oct. 
2003. 
[lo] T. Hanis, S. Marlow, S. P. Jones, and M. Herlihy. Composable 
Memory Transactions. In Proc. of the 10th ACMSytp.  on Principles 
and Practice of Parallel Programming, Chicago, IL, June 2005. 
[ I l l  T. Harris and K. Fraser. Revocable Locks for Non-Blocking 
Programming. In Proc. of the 10th ACM Symp. on Principles and 
Practice of Parallel Programtning, Chicago, IL, June 2005. 
[I21 M. Herlihy, V. Luchangco, and M. Moir. Obstruction-Free Synchro- 
nization: Double-Ended Queues as an Example. In Proc. of the 23rd 
Intl. Conf: on Distributed Computing Systetns, Providence, RI, May, 
2003. 
1131 M. Herlihy, V. Luchangco, M. Moir, and W. N. Scherer 111. Software 
Transactional Memory for Dynamic-sized Data Structures. In Proc. of 
the 22nd ACM Sytnp. on Principles of Distributed Computing, pages 
92-101, Boston, MA. July 2003. 
[14] M. P. Herlihy and J. M. Wing. Linearizability: A Correctness 
Condition for Concurrent Objects. ACM Trans. on Programming 
Languages and Systetns, 12(3):463492, July 1990. 
[15] M. HerIihy. Wait-Free Synchronization. ACM Trans. on Programtning 
Languages and Systems, 13(1): 124-149, Jan. 1991. 
[16] M. Herlihy and J. E. Moss. Transactional Memory: Architectural 
Support for Lock-Free Data Structures. In Proc. of the 20th Intl. Symp. 
on Computer Architecture, pages 289-300, San Diego, CA, May 1993. 
Expanded version available as CRL 92/07, DEC Cambridge Research 
Laboratory, Dec. 1992. 
[17] R. L. Hudson, B. Saha, A,-R. Adl-Tabatabai, and B. Hertzberg. A 
Scalable Transactional Memory Allocator. In Proc. of the 2006 Intl. 
Symp. on Memory Management, Ottawa, ON, Canada, June 2006. 
[I81 Y. Lev and M. Moir. Fast Read Sharing Mechanism for Software 
Transactional Memory (poster paper). In Proc. ofthe 23rdACM Symp. 
on Principles of Distributed Cotnputing, St. Johns, NL, Canada, July 
2004. 
1191 V. J. Marathe and M. L. Scon. A Qualitative Survey of Modem 
Software Transactional Memory Systems. TR 839, Dept. of Computer 
Science. Univ. of Rochester, June 2004. 
[20] V. J. Marathe, W. N. Scherer 111, and M. L. Scon. Design Tradeoffs 
in Modem Software Transactional Memory Systems. In Proc. of the 
7th Workshop on Languages, Compilers, and Run-titne Systems for 
Scalable Computers, Houston, TX, Oct. 2004. 
[21] V. J. Marathe. W. N. Scherer 111, and M. L. Scott. Adaptive Software 
Transactional Memory. In Proc. of the 19th Intl. Symp. on Distributed 
Computing, Cracow, Poland, Sept. 2005. 
[22] K. E. Moore, J. Bobba, M. J. Moravan, M. D. Hill, and D. A. Wood. 
LogTM: Log-based Transactional Memory. In Proc. of the 12th Intl. 
Symp. on High Performance Computer Architecture, Austin, TX, Feb. 
2006. 
[23] R. Rajwar and J. R. Goodman. Transactional Lock-Free Execution of 
Lock-Based Programs. In Proc. of the 10th Intl. Conf. on Architectural 
Support for Programtning Languages and Operating System, pages 
5-17, San Jose, CA, Oct. 2002. 
[24] R. Rajwar, M. Herlihy, and K. Lai. Virtualizing Transactional 
Memory. In Proc. of the 32nd Intl. Symp. on Computer Architecture, 
Madison, WI, June 2005. 
1261 W. N. Scherer I11 and M. L. Scott. Contention Management in 
Dynamic Software Transactional Memory. In Proc. of the ACMPODC 
Workshop on Concurrency and Synchronization in Java Progratns, St. 
Johns, NL, Canada, July 2004. 
1271 W. N. Scherer I11 and M. L. Scott. Advanced Contention Management 
for Dynamic Software Transactional Memory. In Proc. of the 24th 
ACM Symp. on Principles of Distributed Computing, Las Vegas, NV, 
July 2005. 
[28] W. N. Scherer I11 and M. L. Scott. Randomization in STM Contention 
Management (poster paper). In Proc. of the 24th ACM Symp. on 
Principles of Distributed Computing, Las Vegas, NV, July 2005. 
[29] A. Shriraman, V. J. Marathe, S. Dwarkadas, M. L. Scott, D. Eisenstat, 
C. Heriot, W. N. Scherer 111, and M. F. Spear. Hardware Acceleration 
of Software Transactional Memory. TR 887, Dept. of Computer 
Science, Univ. of Rochester, Dec. 2005, revised Mar. 2006. 
1301 A. Shriraman, V. J. Marathe, S. Dwarkadas, M. L. Scott, D. Eisenstat, 
C. Heriot, W. N. Scherer 111, and M. F. Spear. Hardware Acceleration 
of Software Transactional Memory. In ACM SIGPLAN Workshop 
on Languages, Compilers, and Hardware Support for Transactional 
Computing, Ottawa, ON, Canada, July 2006. Held in conjunction 
with PLDl 2006. Expanded version available as TR 887, Dept. of 
Computer Science, Univ. of Rochester, Dec. 2005, revised Mar. 2006. 
1311 R. Wahbe, S. Lucco, T. Anderson, and S. Graham. Efficient Software- 
Based Fault Isolation. In Proc. of the 14th ACM Symp. on Operating 
Systems Principles, Ashvile, NC, Dec. 1993. 
[32] A. Welc, S. Jagannathan, and A. L. Hosking. Transactional Monitors 
for Concurrent Objects. In Proc. of the 18th European Conf. on 
Object-Oriented Programming, pages 519-542, June 2004. 
1251 B. Saha, A.-R. Adl-Tabatabai, R. L. Hudson, C. C. Minh, and B. 
Hertzberg. McRT-STM: A High Performance Software Transactional 
Memory System for a Multi-Core Runtime. In Proc. of the 11th ACM 
Symp. on Principles and Practice of Parallel Programming, New 
York, NY, Mar. 2006. 
] . , . , . , . t , . l t , .
l .
. f i. m .
m
[ ] . .
!
.





f m , ,
1 ] , .
. f
i. ! m
[ ] . III.
f




m ):463- , .
IS li . s. m










f t m ting, ,
[ ] . tt.
.
, .
] 1 , . III, . tt.
s. . f
t il rs, m r
t s, .
] , . III, tt.
. . f i. .
i , , .
. . .
. . f i.
. , .
] 1





















f t i , ,
] , . , . . , .
. III, . .
.
, . .




ti , , , , .
I . .
, . . . .
[ ] . , . .
. f .
, . .




Snapshot Isolation for Software Transactional Memory 
Torvald Riegel Christof Fetzer Pascal Felber 
Dresden University of Dresden University of University of Neuchatel, 
Technology, Germany Technology, Germany Switzerland 
torvald.riegel@tu-dresden.de christof.fetzer@tu-dresden.de pascal.felber@unine.ch 
ABSTRACT 
Software transactional memory (STM) has been proposed 
to simplify the development and to increase the scalability 
of concurrent programs. One problem of existing STMs is 
that of having long-running read transactions co-exist with 
shorter update transactions. This problem is of practical im- 
portance and has so far not been addressed by other papers 
in this domain. We approach this problem by investigat- 
ing the performance of a STM using snapshot isolation and 
a novel lazy multi-version snapshot algorithm to decrease 
the validation costs - which can increase quadratically with 
the number of objects read in STMs with invisible reads. 
Our measurements demonstrate that snapshot isolation can 
increase throughput for workloads with long transactions. 
In comparison to other STMs with invisible reads, we can 
reduce the validation costs by using our lazy consistent snap- 
shot algorithm. 
1. INTRODUCTION 
Software transactional memory  (STM) [20] has been in- 
troduced as a means to support lightweight transactions 
in concurrent applications. It  provides programmers with 
constructs to delimit transactional operations and implic- 
itly takes care of the correctness of concurrent accesses to 
shared data. STM has been an active field of research over 
the last few years, e.g., [ l l ,  13, 7, 12, 18, 17, 4, 10, 81. 
In typical application workloads one cannot always expect 
that all transactions are short. One would expect that ap- 
plications have a mix of long-running read transactions and 
short read or update transactions. One problem of exist- 
ing STMs is that of having long-running read transactions 
efficiently co-exist with shorter update transactions. STMs 
typically perform best when contention is low. For trans- 
actions one should expect that the probability of conflicts 
increases with the length of a transaction. This problem 
is of practical importance but has so far not yet been ad- 
dressed by the other papers in this domain. We address this 
problem by investigating the performance of a STM using 
snapshot isolation [I]. 
The key idea of snapshot isolation (a  more precise descrip- 
tion is given below) is to  provide each transaction T with 
a consistent snapshot of all objects and all writes of T oc- 
cur atomically but possibly a t  a later time than the time at 
which the snapshot is valid. This decoupling of the reads 
and the writes has the potential of increasing the transac- 
tion throughput but gives application developers possibly 
less ideal semantics than, say, STMs that guarantee serial- 
izability (21 or lineariaability [14]. 
Snapshot isolation (SI) has been used in the database do- 
main to address the analog problem of dealing with long 
read transactions in databases. STMs and databases are 
sufficiently different such that it is a priori not sure that 
(PI )  SI will improve the throughput of a STM sufficiently 
and (P2) SI provides the right semantics for application 
programmers. In this paper we focus on problem P1 and 
will only briefly discuss P2. Note that engineering is about 
tradeoffs and typically application developers are willing to 
accept weaker (or, less ideal) semantics if the performance 
gain is sufficiently high over stronger (or, more ideal) alter- 
natives. Hence, the answer to P2 will inherently depend on 
the answer of PI .  
Node Node Node Node Node 
I 
head 
Figure 1: Integer set example. 
r e a d / w r i t e  'confl ic t  + 
Figure 2: Two sample transactions. 
EXAMPLE 1. W e  shall illustrate our work with the same 
example as i n  [13], i.e., a n  integer set implemented as a 
linked list. Specific values can be added to,  removed from, 
or looked up  in the set. Figure 1 shows a n  instance of a n  
integer set with five nodes representing 3 integers (14, 18, 
and 25) and two special values (I and T )  used to  indicate 
the first and last elements of the linked list. W e  shall denote 
these nodes by n l r ,  7x18, nzs ,  n l ,  and n-r, respectively. 
Consider transactions TI  inserting integer 15 in the set 
and T2 looking up  integer 18 (Figure 2). TI must  traverse 
the first three nodes of the list t o  find the proper location 



















st f t. e
14, niB, 25, ol, T, l .
1
. 1



























[ ] z ].
.
to the list. Three nodes (n l ,  nl4, and nls) are accessed but 
only one (n14) is actually updated. T2 also traverses the first 
three nodes, but none of them is updated. 
STM systems typically distinguish read from write accesses 
to shared objects. Multiple threads can access the same object 
in read mode (e.g., node n l  can be read simultaneously by 
TI and T2) but only one thread can access an object in  write 
mode (e.g., nl4 by TI). Furthenore, write accesses must be 
performed in isolation from any read access by another trans- 
action. For instance, assuming that TI tries to write nl4 af- 
ter T2 has read nl4 but before Tz completes (see Figure Z), a 
STM system that guarantees linearizability or serializability 
will detect a confict and abort (or, in the most benign cases, 
delay) one of the transactions. Typically, transactions that 
fail to commit are restarted until they eventually succeed. 
For a SZ-based STM, the two transaction TI and T2 will 
not conflict because T2 is a read transaction that accesses a 
consistent snapshot that is not affected by potentially concur- 
rent writes by TI. Update transactions like TI will also read 
from a consistent snapshot that can become stale before the 
time at which TI writes to rile. The price an application pro- 
grammer has to pay - in  comparison to a serializable STM - 
is that some read/write conflicts might have to be converted 
into write/write conficts (see [16] for more details). For 
example, 2f an update transaction To removes node nlr ,  we 
need to make sure that To writes not only n l  but also nl4 to 
make sure that any concurrent transaction like TI that in- 
serts a new node directly after nl4 has a write/write conflict 
with TO. 
Regarding problem P2, snapshot isolation avoids com- 
mon isolation anomalies like dirty reads, dirty writes, lost 
updates, and fuzzy reads [I]. Because snapshot isolation 
circumvents readlwrite conflicts, application programmers 
might need to convert readlwrite conflicts into writelwrite 
conflicts if the detection of the former are needed to enforce 
consistency [16]. On a very high level of abstraction, this 
is similar t o  the inverse problem of deciding which objects 
can be released early [13]: in early release a programmer 
can remove the visibility of read objects while in SI a pro- 
grammer might need to make certain objects in the read set 
11 ' ' visible" by dummy writes. However, SI guarantees that the 
read snapshot always stays consistent which might simplify 
matters in comparison to using an early release mechanism. 
In this paper, we propose a software transactional mem- 
ory SI-STM that integrates several important features to  
ease the development of transactional applications and max- 
imize their efficiency. We improve the throughput of work- 
loads with both short transactions and long read transaction 
by eliminatinglreducing readlwrite contention, by investi- 
gating a novel multi-version concurrency control algorithm 
that implements a variant of snapshot isolation. We use a 
variant because instead of letting always the first committer 
win, we let a contention manager decide which transaction 
wins a writelwrite conflict. We have developed an original 
algorithm to implement a multi-version isolation level based 
on snapshot isolation that can-if so requested-ensure lin- 
earizability of transactions. This algorithm is implemented 
without using any locks, which are known to severely limit 
scalability on multi-processor architectures and introduce 
the risk of deadlocks and software bum. - 
Our experimental evaluation of a prototype implementa- 
tion demonstrates the benefits of our architecture. The per- 
formance of our prototype is competitive with lock-based 
implementations and it scales well in our benchmarks. 
The rest of the paper is organized as follows: Section 2 
discusses related work and Section 3 introduces the principle 
of snapshot isolation more precisely and describes efficient 
algorithms to implement it, with or without additional lin- 
earizability of individual transactions. Section 4 presents 
our STM implementation and Section 5 describes its seam- 
less integration in the Java language using only standard 
Java mechanisms. We evaluate the efficiency of our archi- 
tecture and algorithms in Section 6. Finally, Section 7 con- 
cludes the paper. 
2. RELATED WORK 
2.1 Software Transactional Memory 
Software Transaction Memory is not a new concept [20] 
but it recently attracted much attention because of the rise 
of multi-processor and multi-core systems. There are word- 
based [ll] and object-based [13] STM implementations. The 
design of the latter, Herlihy's DSTM, is used by several cur- 
rent STM implementations. Our SI-STM is object-based 
and thus uses some of DSTM1s concepts. However, SI-STM 
is a multi-version STM, whereas in DSTM objects have only 
a single version. Furthermore, existing STM implementa- 
tions only provide strict transactional consistency, whereas 
SI-STM additionally provides support for snapshot isola- 
tion, which can increase the performance of suitable appli- 
cations. 
In the original STM implementations, reads by a transac- 
tion are invisible to other transactions: to  ensure that con- 
sistent data is read, one must validate that all previously 
opened objects have not been updated in the meantime. If 
reads are to  be visible, transactions must add themselves 
to  a list of readers at  every transactional object they read 
from. Reader lists enable update transactions to detect con- 
flicts with read transactions. However, the respective checks 
can be costly because readers on other CPUs update the list, 
which in turn increases the contention of the memory inter- 
connect. Scherer and Scott [19, 181 investigated the trade- 
off between invisible and visible reads. Thev showed that 
visible reads perform much better in several benchmarks 
but, ultimately, the decision remains application-specific. 
Marathe et al. [17] present an STM implementation that 
adapts between eager and lazy acquisition of objects (i.e., 
a t  access or commit time) based on the execution of previ- 
ous transactions. However, they do not explore the trade-off 
between visible and invisible reads but suggest that adap- 
tation in this dimension could increase ~erformance. Cole 
and Herlihy propose a snapshot access mode [4] that can be 
roughly described as application-controlled invisible reads 
for selected transactional objects with explicit validation by 
the application. The only STM that we are aware of hav- 
ing a design similar to  ours is [3]. However, in their STM 
design, every commit operation, including the upgrade of 
transaction-private data to  data accessible by other threads, 
synchronizes on a single global lock. Thus, this design is 
not fault-tolerant because there is no roll-back mechanism 
for commits. Additionally, even in cases where write oper- 
ations do not conflict, only a single thread can be used for 
updating memory. No performance benchmark results are 
provided. 
Read accesses in our SI-STM are invisible to other trans- 
t t li t. hr e (n.L, 1 , IB) t
l ( I ) i t ll t . l t t fi t
t s, t f t i t .
t t i ll i ti i f it
t j cts. lti l t t j t
i . ., .L i lt ly
t l t j t it
. ., I . rtherm ,
f i i l ti f t t -
ti n. r t I
t r I 2 l t 2 ,
ilit
ill t fli t ,
l ) f t t ti . pically, t ti t t
il t i .
r f-based , I
li
i t
rit t I i
t i r n14
i t / ite li
i t ite fli t { }
a ple, i 14
t r o r .L I
r r I










" it . r,
tt r
i ,
i i i i cy.
l t




i / it li t. l i i l
l it i l t lti i i l ti l l
t n-if ested-ensure
,
l ilit lti- r r r it t r i tr
l gs.
i t l l ti i l t
ti t t it it t .
f r f r r t t i titi it l - sed
i l t ti it l ll i r r .
r st f t r is r i s f ll s: ti
i l t ti i t t i i le
f t is l ti r r is l s ri s ffi i t
l rit t i l t it, it r it t iti l li -
i ilit f i i i l t ti . ti ts
tion
l i t r ti i t l i l t


































i , ti , i l i f




ti . e rk lt
r i .
i r I- r i i i l t t r tr -
actions but do not require revalidation of previously read 
objects on every new read access. The multi-version infor- 
mation available to each transactional objects provides in- 
expensive validation by inspection of the timestamps of each 
version (without having to access previously read objects). 
We thus get the benefits of invisible reads but at a much 
lower cost. 
Most STM implementations support explicit transaction 
demarcation and read and write operations, whereas only 
a few provide more convenient language integration. Har- 
ris and Fraser propose adding guarded code blocks to the 
Java language [Ill ,  which are executed as transactions as 
soon as the guard condition becomes true. SXM [9] is an 
object-based STM implementation in C#, which uses at- 
tributes (similar to Java annotations) for the declaration of 
transaction boundaries but requires additional code to call a 
transaction (i.e., the call is different from a normal method 
call). They suggest extending the C# post-processor to im- 
plicitly start transactions. In contrast, our SI-STM employs 
widely used aspect weavers and Java's annotat ions to trans- 
parently add transaction support. It  does not require any 
changes to the programming language. 
Most STM implementations are obstruction-free and use 
contention managers [13] to ensure progress. Scherer and 
Scott presented several contention managers [19, 181 includ- 
ing the Karma manager used in Section 6. Guerraoui et 
al. investigated how to mix different managers [9] and pre- 
sented the Greedy [lo] and FTGreedy [8] managers, which 
respectively guarantee a bound on response time and achieve 
fault-tolerance. 
2.2 Snapshot Isolation 
Snapshot isolation was first proposed by Berenson et al. [I] 
and is used by several database systems. Elnikety et al. 
present a variant [5] of snapshot isolation in which trans- 
actions are allowed to read versions of data that are older 
than the start timestamp of the transaction. They use this 
weaker notion for database replication but require conven- 
tional snapshot isolation for transactions running on the 
same database node. 
Conditions under which non-serializable executions can 
occur under snapshot isolation are analyzed by Fekete et 
al. 1.61. They show how to modify applications to execute 
correctly under snapshot isolation and show that the TPC-C 
benchmark, an important database benchmark that is rep- 
resentative for real-world applications, runs correctly under 
snapshot isolation. 
Lu et al. formalize in [I61 the conditions under which 
transactions can be safely executed with snapshot isolation. 
They use a notion of semantic correctness instead of strict 
serializability. This way, the checks that have to be per- 
formed to ensure correctness are reduced to the combina- 
tions between the postcondition of the set of all read op- 
erations of a transaction and the write operations of other 
transactions. No further intermediate states have to be con- 
sidered. We have used their conditions to construct SI-safe 
implementations of a linked list and a skip list. 
3. SNAPSHOT ISOLATION 
The idea of snapshot isolation [I] is to  take a consistent 
snapshot ST of the data at the time  start^ when a transac- 
tion T starts, and have T perform all read and write opera- 
tions on S T .  When an update T tries to commit, it has to 
get a unique timestamp  commit^ that is larger than any ex- 
isting start  or commit timestamp. Snapshot isolation avoids 
writelwrite conflicts based on the first-committer-wins prin- 
ciple: if another transaction T2 commits before T tries to 
commit and T2's updates are not in T's snapshot S T ,  i.e., 
commitT, >  start^, then T has to be aborted. 
Snapshot isolation does not guarantee serializability but 
avoids common isolation anomalies like dirty reads, dirty 
writes, lost updates, and fuzzy reads I:1]. Snapshot isolation 
is an optimistic approach that is expected to perform well 
for workloads with short update transactions that conflict 
minimally and long read-only transactions. This matches 
many important application domains and slight variations of 
snapshot isolation are used in common databases like Oracle 
and Microsoft SQL server [6]. Hence, we are investigating 
if snapshot isolation could be a good foundation for STMs 
too. 
3.1 Design and Semantics 
Our SI-STM provides the same properties as standard 
snapshot isolation except that we do not enforce the first- 
committer-wins principle. Instead, as in other obstruction 
free STM implementations, we use contention managers to 
arbitrate writelwrite conflicts. We also provide the option to 
enforce linearizability for transactions: at commit time, we 
check for readlwrite conflicts and only permit transactions 
to commit if they have neither writelwrite nor readlwrite 
conflicts. 
Our major goal was to develop a lightweight snapshot al- 
gorithm that can both decrease the overhead of snapshot 
isolation and maximize the freshness of the objects used in 
a transaction. The motivation behind the freshness require- 
ment is twofold. First, to address the often heard critique 
about snapshot isolation being difficult to use because it ac- 
cesses old data. Second, to  reduce the number of writelwrite 
conflicts and the memory footprint of the system (by facil- 
itating that old versions be discarded earlier). Indeed, the 
fresher the data in the snapshot, the lower is the probability 
of having a writelwrite conflict because it might contain the 
newest data written by other transactions. 
The main feature of our design is a lazy interval snapshot 
algorithm. Instead of taking a snapshot at the start of a 
transaction T ,  we lazily acquire a snapshot: we add a copy 
of an object o to the snapshot just before T accesses o for 
the first time. Preferably, we would like to add 0's latest 
version, i.e., a copy taken after the most recent committed 
transaction that updated o. However, this might not guar- 
antee that the snapshot remains consistent. We say that 
a snapshot ST is consistent iff there exists a time t such 
that each copy ci of object oi in ST corresponds to the most 
recent version of oi at time t .  
To keep a snapshot consistent, one could perform a vali- 
dation of the snapshot whenever adding a new object to  the 
read set. A naive validation would be quadratic in the size 
of the read set. This would be unacceptable for large trans- 
actions. To address this issue, we designed a new algorithm 
to determine the consistency more efficiently. 
Each transaction T lazily acquires a consistent interval 
snapshot ST that is valid within an non-empty validity in- 
terval VT = [minT ,  maxT]:  each copy c+ of object oi in ST 
is the most recent version of oi for any time in VT and no 
other transaction can commit a newer version of oi in in- 
terval ( m i n T ,  m a z ~ ] .  The validity interval is computed on 
acti s t t re ire re ali ati f re i sl rea
j ts e er e rea access. lti- ersi i f r-
ti il l t tr s ti l j ts r i s i -
si li ti i s ti f t ti st s f
rsi ( it t i t ss r i sl r j ts).
t s t t fits f i isi l r s t t
l er c st.
st i l t ti s rt li it tr s ti
r ti r rit r ti , r l
f r i r i t l i t r ti . r-
ri r r r i l t t
l [11], i r t tr ti
t r iti tr . [ ] i
j t i l t ti i i t
t i il l ti
tr ti ri t r ir s iti l t ll
tr ti (i. ., t ll i iff r t fr r l t
ll). t t i t t- r s r t i -
ti . tr st,
i l t rs ' t ti t t
tl t ti t. t t i
s

















r li ti s,
l ti .












t t rts, ll r
t . it,
get a unique ti esta p co itT that is larger than any ex-
isting st rt or co it ti esta p. napshot isolation avoids
rite/ rite conflicts based on t e first-co itter- ins prin-
ci le: if t r tr s ti 2 c its ef re tries t
co it and 's ates are t in 's snapshot , i.e.,
it 2 startT, t s t rt .
a s t is lati es t ara tee serializa ility t
a i s c is lati a alies li e irt rea s, irt
rites, l st ates, f zz rea s 1'1]. a s t is lation
is ti isti r t t is t t rf r ell
f r r l s it s rt t tr s ti s t t fli t
i i ll l rea - l tr s ti s. is t s
i rt t li ti i s sli t ri ti s f
t i l ti i t li l
i t ]. , i ti ti
if t is l ti l f ti f r
t .
.
i t ti s t
t i l ti t t t i t






























t c tl .
l
i t
T inT, T]: ; f 0
f 0 T
i f 0i i i
inT, axT]. i ity t
the fly according to the objects read by the transaction and 
their available versions. Of course different transactions will 
share a copy ci as long as these transactions only perform 
read accesses. 
Let firstT be the time when transaction T accesses its 
first object. Our algorithm constructs a snapshot ST with 
validity interval VT = [minT,  m a x ~ ] ,  where maxT 2 m i n ~ .  
We guarantee that the snapshot is valid at  some point in 
time that follows, or coincides with, the first access, i.e., 
maxT 2 f i t T .  The validity interval of the snapshot can 
be such that m i n ~  > firstT. This means that, unlike other 
optimizations of snapshot isolation that use snapshots of 
the past, we can actually take a snapshot of the future, i.e., 
not yet valid at the time the transaction starts processing. 
To simplify matters, we define the effective start time of 
transaction T as max(firstT, minT) .  In that way, a snapshot 
is conceptually taken at  the start of a transaction-just as 
expected by snapshot isolation. 
3.2 Algorithm 
Each update transaction T has a unique commit times- 
tamp  commit^. The timestamps used in our implementa- 
tion are all based on unique and monotonically increasing 
integer values for commit times. This allows us to associate 
each object o with a history of object versions ovl ,ov2,.  . . 
with vi+l > vi and object version ovi being valid in the time 
range [v,, vi+l - 11. We call this range the validity interval 
of object version ovi. It  indicates that o was updated by a 
transaction that committed at  time vi and no other transac- 
tion has committed a new version of o within (vi,vi+l - 11. 
The validity interval of object versions allows us to ass@ 
ciate the snapshot S T ,  constructed lazily by a transaction T ,  
with a validity interval VT = [minT ,  maxT].  VT is the inter- 
section of the validity intervals of all object versions in ST .  
Hence, each object version in ST was committed no later 
than m i n ~  and no transaction committed another version 
within VT. 
Read access: When a transaction T reads an object o that 
is not yet in S T ,  we look for the most recent version ovi with 
a validity interval V that overlaps VT.  We compute the new 
validity interval of the transaction as the intersection of V 
and V T .  
Write access: When a transaction T tries to  update an 
object o for the first time, a private copy of this object is 
created. We only permit one transaction to acquire a private 
copy of an object. If a second transaction Tz attempts to up- 
date o before T committed its changes, we have a writelwrite 
conflict. In this case, the contention manager is called to  de- 
termine which of the two transactions needs to  be aborted 
(or delayed). In that way, we perform a forward validation 
of update transactions. 
Commit: A transaction can commit as lone as its valid- " 
ity interval VT = [minT,  maxT] is non-empty, i.e., maxT 2 
m i n ~ .  If we keep a sufficiently long history of objects, the 
validity interval will never become empty. When an u p  
date transaction commits, it receives a unique timestamp 
 commit^. Read-only transaction do not have a unique com- 
mit timestamp as they do not update objects. 
Memory Overhead: In our measurements we keep a small 
number k of old variants for each obiect. In future we will 
change this and will use a fixed number of weak references 
to  old variants of an object instead. In this way, the Java 
garbage collector will be able to automatically reclaim old 
T . ~ ( 0 1 )  ,=0;' r(02) = 0:2 r (03)  = C 
11 13 15 t i m e  
Figure 3: A transaction reading three objects. 
variants in case more memory is needed. The memory over- 
head will then depend on the available memory, i.e., no ad- 
ditional copies are kept in case no memory is available and 
up to k variants if the Java virtual machine has sufficient 
memory available. 
Extension of validity intervals: When a transaction T 
adds the most recent object version ovi to  its snapshot S T ,  
the time vi+l at  which ov' expires is not yet known (oth- 
erwise, ovi would not be the most recent version). Thus, 
we set the upper bound on ovi's validity temporarily to the 
most recent commit time   commit^,, where T, is the most 
recently committed transaction). 
To extend the validity range of transaction T ,  we check if 
any temporary upper bound on the validity of the objects in 
ST can be shifted to  a later time. Our system tries to extend 
the validity interval VT if VT becomes empty. The goal of 
this extension is to decrease the abort frequency. Additional 
proactive extensions could be useful in some cases. However, 
deciding whether extension costs are justified by possible 
throughput gains is nontrivial and remains a task for future 
work. 
EXAMPLE 2. To illustrate the concepts of lazy snapshot 
isolation, consider a transaction T that reads objects 01 ,  02 
and 03 (see Figure 3). When T accesses 02 for the first 
time at time 13, T reads the most recent version o12 of 02 
even though this version did not yet exist when T read 01 at 
11. When accessing 03 at 15, T cannot use the most recent 
version 0i4 of 03 because the validity intervals of oiO and 
0i4 do not overlap. Therefore, the snapshot S of T consists 
of object versions oiO, ol2 ,  and 0;' with a validity interval 
VT = [12,13]. 
3.3 Linearizability 
We have implemented an optimistic approach that can 
enforce linearizability [2] of transactions. If a program- 
mer requests linearizability, a transaction T can only com- 
mit a t  time  commit^ if its validity interval contains time 
 commit^ - 1, i.e., all objects read by T are still valid at  
the time T commits. The intuition is that all object ver- 
sions in T's snapshot are valid up to T's commit time and, 
hence, there are neither readlwrite nor writelwrite conflicts 
affecting T .  
To minimize aborts, a transaction T will try to  extend 
its validity interval before committing. If there are no read- 
/write conflicts, i.e., no objects of T's read-set have been 
updated, T will be able to extend the validity interval to  
the current time and consequently commit. 
4. STM IMPLEMENTATION 
We now describe the architecture developed to support 
lightweight transactions in Java. Our transactional mem- 


















§4 f f ~o
§4 . f , f






























inT, axT], X : inT.
s, ,
X : irstT .
inT T .
,
tT , )' ,
saction-just
commitT.
0 " V2, . .
V 1 V ,
Vi V 1 ].
,. 0
V


























ory is implemented as a software library. The main compo- 
nents exposed to the application developer are transactions 
and transactional objects. In addition, it features a modu- 
lar architecture for dealing with contention and transaction 
management. 
4.1 Transactions 
Transactions are implemented as thread-local objects, i.e., 
the scope of a transaction is confined inside the current 
thread of control. The application developer can program- 
matically start a transaction, try to commit it, or force it to 
abort. 
As in [13], transaction objects (see Figure 4) contain a 
status field, initially set to  ACTIVE, that can be atomically 
changed to either COMMITTED or ABORTED using a compare 
and swap (CAS) operationedepending on whether the trans- 
action successfully completes or not. A transaction object 
can additionally keep track of the objects being read and 
updated (read-set and write-set) and maintains timestamps 
indicating the transaction's start and commit times. Times- 
tamps are discrete values generated by a global lock-free 
counter that can be atomically incremented and read. 
4.2 Transactional Objects 
Transactional objects are STM-specific wrappers that con- 
trol accesses to application objects. They manage multiple 
version of the object's state on behalf of active transactions. 
Regular objects being wrapped must be able to duplicate 
their state, i.e., clone themselves, as transactional wrappers 
need to create new versions. 
Before being used by the application, a transactional ob- 
ject must be "opened", i.e., a reference to the current state 
of the application object must be acquired. A transactional 
object can be opened for reading or for writing. If a transac- 
tion opens the same object multiple times, the same state is 
returned. An object opened for reading can be subsequently 
opened for writing (similar to lock promotion in databases). 
Opening a transactional object may fail and force the cur- 
rent transaction to abort. 
4.3 Contention Management 
Conflicts are handled in a modular way by the means of 
contention managers, as in [13]. Contention managers are 
invoked when a conflict occurs between two transactions and 
they must take actions to resolve the conflict, e.g., by abort- 
ing or delaying one of the conflicting transactions. Con- 
tention managers can take decisions based on information 
stored in transaction objects (read- and write-set, times- 
tamps), as well as historical data maintained over time. In 
particular, contention managers can request to  be notified of 
transactional events (start, commit, abort, read, write) and 
use this information to implement sophisticated conflict res- 
olution strategies. 
4.4 Transaction Management 
Our STM implementation currently supports two trans- 
action management models. The first one is very similar 
to  the SXM of Herlihy et al. [9], which is in turn simi- 
lar t o  DSTM [13] but uses visible reads. It  allows multi- 
*A CAS operation on a variable takes as argument a new 
value v and an expected value e. I t  atomically sets the value 
of the variable to v if the current value of v is equal to e. It  
returns the value of v that was read. 
ple readers or a single writer-but not both-to access a 
given object. Updates to a shared object are performed on 
a transaction-local copy, which becomes the current version 
when the transaction commits. A single consistent version 
of each shared object is maintained at  a given time. Support 
for SXM has been implemented essentially for comparison 
purposes and we shall not describe it further. 
The second transaction management model, termed SI- 
STM, implements multi-version concurrency control and snap- 
shot isolation as described in Section 3. Shared objects are 
accessed indirectly via transactional wrappers that can be 
invoked concurrently by multiple threads and effectively be- 
have as transactional objects. 
Transactional objects maintain a reference to a descrip- 
tor, called locator 1131, that keeps track of several versions 
of the object's state (see Figure 4): a tentative version being 
written to  by an update transaction ( tentat ive) ;  a com- 
mitted version ( s t a t e )  together with its commit timestamp 
(commit-ts); and the n previous committed versions of the 
object (old-versions) together with their commit times- 
tamp. n is a small value that is typically between 1 and 8. 
A locator additionally stores a reference to the writer, i.e., 
the transaction that updates the tentative version, if any 
(transaction). Note that the locator does not keep track 
of transactions that read the object. 
References to  a locator can be read atomically and up- 
dated using a CAS operation. Once a locator has been reg- 
istered by a transactional object, it becomes immutable and 
is never modified. When a transactional object is created, 
its locator is initialized with the state of the object being 
wrapped as committed version, and 0 as commit timestamp; 
other fields are set to null. 
We define the current version of the object as follows: if 
the t ransac t ion  field of the locator is null, or if the last 
writer has aborted, then the current version corresponds to 
the committed version of the object ( s ta te )  with its as- 
sociated commit timestamp (commit-ts); if the last writer 
has committed, then the current version corresponds to the 
tentative version of the object ( t en ta t ive)  with a commit 
timestamp equal to  that of the writer; finally, if the writer 
is still active, the current version is undefined. 
When a transaction accesses an object in write mode for 
the first time, we check in the current locator whether there 
is already an active writer. If that is the case, there is a con- 
flict and we ask the contention manager to arbitrate between 
both transactions before retrying. Otherwise, if a validity 
condition to be described shortly is met, we create a new 
locator and register the current transaction as writer. We 
store references to the current and previous versions in the 
new locator and we create a new tentative version by du- 
plicating the state of the current version. Finally, we try to  
update the reference to the locator in the transactional ob- 
ject using a CAS operation. If this fails, then a concurrent 
transaction has updated the reference in the meantime and 
we retry the whole procedure. Otherwise, the current trans- 
action continues its execution by accessing its local tentative 
version. 
EXAMPLE 3. Consider the example in  Figure 5. Trans- 
action TI is registered as writer in the locator of the trans- 
actional object. As TI has committed, the tentative version 
corresponds to the current state of the object, with a commit 
timestamp of 53. Transaction Tz accesses the transactional 
r is i l t s s ft r li r r . i -
e ts e se t t a licati e el er are tr s cti s
a tr s cti l jects. I a iti , it feat res a -
l r r it t r f r li it t ti tr s ti
t.
. i
r s ti s r i l t s t r -l l j ts, i. .,
t f tr ti i fi i i t rr t
t r f tr l. li ti l r r r -
ti ll t t t ti , t t it it, f r it t
rt.
s i [ 3], tr s ti j ts (s i r ) t i
t t s i l , i iti ll t t I , t i ll
I i
S) ti -depending t t t
ti f ll l t r t. tr ti j t
it
i i ti ' .
t i t l t l l l
t
. t
t l t .
t f
l













t ti a er , 3].
i




t s), s l
rti lar,
t t t rt, t,









pie readers or a single riter-but not oth-to access a
given object. pdates to a shared object are perfor ed on
a transaction-local copy, hich beco es the current version
e t tra sacti c its. si le c siste t ersi
of each shared object is aintained at a given ti e. upport
f r s i l t ss ti ll f r ris
r ses a e s all t escri e it f rt er.
e second tra sacti a a e e t el, ter e I-
, i le e ts lti- ersi c c rre c c tr l s a -
s t is lati as escri e i ecti . r j ts r
i ir tl i t ti l t t
i rr tl lti l t r ff ti l -
t ti l j t .
ti l j t i t i t i
t , ll l t [ ], f l i
f t j t' t t (s i r ): t t tiv rsi i
) t
L i f t
_ i
l
l t iti ll t t it , i. .,












mit_t f it r
i s
f
















. 1 , tiv i
t t t t t f j t, it t
f . 2 ti l
Figure  4: Sample  locator  for  a transac-  F igure  5: Sample  locator  for  a transact ional  ob jec t  wi th  a 
t ional  ob jec t  w i t h  a n  act ive wri ter .  T h e  commit ted  wr i te r  T I  (left). A n o t h e r  t ransac t ion  T2 o p e n s  
la test  consistent version is Data3 w i t h  va- t h e  t ransact ional  ob jec t  in  wr i te  m o d e  a n d  creates  a new 
lidity s ta r t ing  at t i m e  45. locator  (right).  
object in write mode and creates a new Locator, with versions 5. LANGUAGE INTEGRATION 
shifted by one position with respect to the old locator (the Most of the STM implementations we know of provide ex- 
old tentative version becomes the new committed version). plicit constructs for transaction demarcation and accesses to 
Then, T2 creates a COPY of the current state as tentative transactional objects. The programmer uses special opera- 
version and uses a CAS operation to update the reference to tions to start, abort, or commit the transaction associated 
the locator in the transactional object. ' with the current thread, as well as retry transactions that 
fail to commit. Further, the programmer needs to explic- 
One can note that the algorithm for accessing trans=- itly instantiate transactional objects and provide support 
tional objects in write mode follows the same general prin- for creating copies of the wrapped objects, 
ciple as in DSTM, with variations resulting principally from Our STM implementation is no exception and features 
versioning and timestamp management. In contrast, read such a programmatic interface. It  features a declarative ap- 
operations are handled in a very different manner. As a proach for seamless integration of lightweight transaction 
matter of fact, the key to the efficiency of our SI-STM model in Java applications. To that end, we use a combination 
is that no modification to the locator nor validation of pre- standard techniques: the feature of J~~~ 1.5 
viously read objects is necessary when accessing a trans=- together with aspect-oriented programming (AOP) [15]. An- 
tional object in read mode. notations are metadata that can be associated with types, 
Each version has a validity range, i.e., an interval between methods, and fields and allow programmers to decorate Java 
two timestamps during which the version was representing code with their own attributes, Aspect-oriented program- 
the current state. This range starts with the commit times- ,ing is an approach to writing software, which allows devel- 
tamp of the version and ends one time unit before the corn- opers to easily capture and integrate cross-cutting concerns, 
mit timestamp of the next version. For instance, in Fig- or aspects, in their applications, 
ure 4, Data' and Data2 have validity ranges of [31,38) and 
[38,45), respectively; Data3 has a validity range starting a t  5.1 Declarative STM Support 
45 with an upper bound still unknown. For each transac- Our language integration mechanisms provide implicit trans- 
tion, we also maintain a validity range that corresponds to action demarcation and transparent access to transactional 
the intersection of the validity ranges of all the objects in objects. The programmer only needs to add annotations 
its read-set. A necessary condition for the transaction to be to relevant classes and methods. He is freed from the bur- 
able to  commit is that this range remains non-empty. den of dealing programmatically with the STM, which in 
When opening a transactional object in read mode, the turn limits the risk of introducing software bugs in complex 
transaction searches through the committed versions of the transactional constructs. 
object starting by the most recent and selects the first that 
intersects with its validity range. If there is no such version, j e l a l  objects 
we try to extend the validity range of the transaction by Transactional objects to be accessed in the context of con- 
recomputing the unknown upper bounds of the objects in current transactions must have the annotation @Transactional. 
the read set, as described in Section 3. If the intersection All accesses to their methods and fields are managed by the 
remains empty after the extend, the transaction needs to transactional library so as to guarantee isolation. Specific 
abort. In all other cases, we simply update the validity methods can be additionally annotated by @Readonly to  in- 
range of the transaction and return the selected version. dicate that they do not modify the state of the target object; 
We can now describe the missing validity condition on the transaction manager relies on this information to distin- 
write accesses. Tentative versions also have an open-ended guish reads from writes. 
validity range, which starts with the commit timestamp of As mentioned in Section 4, transactional objects should 
the cloned state and must also intersect with the validity be able to  clone their state. Support for object duplication is 
range of the transaction. Therefore, a write access will fail added transparently to transactional objects, provided that 
if the commit timestamp of the current version is posterior to all their instance fields are either (1) of primitive type, or 































ti t t t, t, it t t ti i t
.
,






























not the case, the transactional object should define a p u b  
lic method dup l i c a t e0  that performs a deep copy of the 
object's state. 
5.1.2 Specifying transaction demarcation 
Our language integration mechanisms also feature implicit 
transaction demarcation: methods that have the annotation 
@Atomic will always execute in the context of a new trans- 
action. Such atomic method are transparently reinvoked if 
the enclosing transaction fails to  commit due to conflicting 
accesses to transactional objects. Transactions that span 
arbitrary blocks of code must using explicit demarcation. 
Alternatively, a method can be declared with the @Isolated 
annotation. The difference between atomic and isolated is 
subtle: if an exception is raised by an atomic method, the 
enclosing transaction is aborted before propagating the ex- 
ception to the caller; in contrast, isolated methods always 
commit the partial effects of the transaction before propa- 
gating the exception. The choice between atomic and iso- 
lated methods depends on the application semantics. 
EXAMPLE 4. Figure 6 presents an implementation of the 
integer set introduced in Example 1. Observe that the code 
makes no reference to STM, with the exception of the anno- 
tations. Transactional constructs are transparently weaved 
in the application by AOP. Compare this code with the ex- 
plicit approach presented in [I  31. 
5.2 AOP Implementation 
Our STM implementation uses AOP to transparently add 
transactional support to  the application based on the an- 
notations inserted by the developer. Each object declared 
as transactional is extended with a reference to a transac- 
tional wrapper, methods to open the object in read and 
write mode, and support for state duplication. 
We use AOP around advices to transparently create a new 
transaction for each call to  an atomic or isolated method. 
Transactions that fail to commit are automatically retried. 
Similar advices are defined on transactional objects to in- 
tercept and redirect method calls and field accesses to the 
appropriate version. 
The AOP weaver integrates the aspects in the application 
at compile-time or a t  load-time. In comparison with explicit 
transaction management, an application that uses declara- 
tive STM incurs a small performance penalty, mostly due to 
the additional runtime overhead of advices and the extra in- 
direction for every access to a transactional object (instead 
of the first access only). Overall, the efficiency loss remains 
very small and is easily compensated by the many benefits 
of implicit transaction demarcation and transparent access 
to transactional objects. Note finally that declarative and 
programmatic constructs can be mixed within the same ap- 
plication. 
6. PERFORMANCE EVALUATION 
To evaluate the performance of our STM with snapshot 
isolation, we compared it with two other implementations. 
The first one follows the design of SXM by Herlihy et al. [9], 
an object-based STM with visible reads, with a few minor 
extensions. The second follows the design of Eager ASTM 
by Marathe et al. as described in [17]. Henceforth, we 
shall call these STM implementations SI-STM, SXM, and 
ASTM. Read operations in SXM are visible to other threads, 
whereas they are invisible in ASTM and SI-STM. Where 
appropriate, we show results for another variant of ASTM 
that only validates the read objects at the end of a trans- 
action (single-validate ASTM). All other STM implementa- 
tions guarantee that all objects read in a transaction always 
represent a consistent view. Note that we compare SI-STM 
with similarly designed STMs so as t o  determine the per- 
formance of snapshot isolation and SI-STM's inexpensive 
validation. 
We use five micro-benchmarks: a simple bank application; 
two micro-benchmarks to investigate the CPU time required 
for the read and write operations of an STM; and an integer 
set implemented as a sorted linked list; and an integer set 
implemented as a skip list. 
The bank micro-benchmark consists of two transaction 
types: (1) transfers, i.e., a withdrawal from one account fol- 
lowed by a deposit on another account, and (2) computation 
of the aggregate balance of all accounts. Whereas the former 
transaction is small and contains 2 readlwrite accesses, the 
latter is a long transaction consisting only of read accesses 
(one per account). To highlight the advantages of STMs, 
we additionally present results for fine-granular and coarse- 
granular lock-based implementations of these transactions, 
in which locks are explicitly acquired and released. The for- 
mer uses one lock (standard monitor implementation) per 
account while the latter uses a single lock for all accounts. 
Note that the lock-based implementation has lower runtime 
overhead as it uses programmatic constructs instead of the 
declarative transactions of SI-STM; hence, comparison of 
absolute performance figures is not exactly fair. 
We executed all benchmarks on a system with four Xeon 
CPUs, hyperthreading enabled (resulting in eight logical 
CPUs), 8GB of RAM, and Sun's Java Virtual Machine ver- 
sion 1.5.0. We used the virtual machine's default configu- 
ration for our system: a server-mode virtual machine, the 
Parallel garbage collector, and a maximum heap size of 1GB. 
We set the start size of the heap to its maximum size. Re- 
sults were obtained by executing five runs of 10 seconds for 
every tested configuration and computing the 20% trimmed 
mean, i.e., the mean of the three median values. All STMs 
use the Karma [I91 contention manager. 
Figure 7 shows the throughput results for the bank ap- 
plication with 50 and 1024 accounts, and with 0% and 10% 
read transactions (other transactions are money transfers). 
Note that throughput is the total throughput of all threads 
and that the number of threads is shown with a logarithmic 
scale. 
Under high write contention workloads (50 accounts) and 
without long read-only transactions, SI-STM has slightly 
higher overhead than SXM and ASTM. For larger numbers 
of accounts (not shown), throughput increases for the STMs 
and fine-grained locks because of less contention. 
SI-STM also scales well when there are long read-all trans- 
actions, whereas SXM suffers from a high conflict rate be- 
cause of visible reads and cannot take advantage of addi- 
tional CPUs. Although both SI-STM and ASTM use invisi- 
ble reads, the throughput of the ASTM version that always 
guarantees consistent reads is very low because of the valida- 
tion overhead. When ASTM only performs validation at the 
end of a read-only transaction (single-validate), the through- 
put is significantly higher. However, the transactions might 
read inconsistent data. For example, if a transaction needs 
not the case, the transactional object should define a pub-
lic t licate () t t erf r s a ee c f t
j t's st te.
.1.2 if i t ti ti
r l i t r ti is s ls f t r i li it
tr ti r ati n: t t t t t ti
~Atomic ill l s t i t t t f tr s-
ti . t i t r tr r tl r i if
t l i tr ti f il t it t fli ti
ss s t t ti l j t . ti t t
r itr r l f t i li it r ti .
lt r ti l , t l r it t ~Isolated
tati n. i t t i i l t i
tle: if ti i r i t i t , t
l si tr ti is rt f r r ti t -
ti t t ll r; i tr st, i l t t l
ts
t ti . i t t i i
l .
AMPLE . f
i t l .
, i f
t . r


















l ). ll, i
l
i l




i l ti ,
i t ll s . ],
t i s. ll s
t l. i i 7]. t ,
ll l , ,
AST . ead operations in S are visible to other threads,
hereas they are invisible in ST and SI-ST . here
appropriate, e sho results for another variant of
t t only validates t e read jects at t e e of a tra s-
action (single-validate ). ll other i ple enta-
tions ara tee t at all jects read in a tra sacti al ays
re rese t a c siste t ie . te t t e c are I-
it si ilarly designed s so as t eter i e t e per-
f r a ce f s a s t is lati a I- 's i e e si e
li ti .
s fi i r - r s: si l nk li ti ;
t i r - ch arks t i ti t t ti r ired
f r t r rit r ti s f ; i t r
t i l t s rt li list; i t r t
i l t i li t.
i i t f t t ti
t : t , i. ., l t l
l it t r t, ( ) t ti
f t r t l f ll t . r t f r r
t ti i ll t i / it , t
l tt i l t ti i ti l f
t . i li t t f ,
iti ll t lt i l r
l l i l t ti f t t ti ,
l






























p u b l i c  c l a s s  Node { 
p r i v a t e  i n t  value; 
p r i v a t e  Node next; 
p u b l i c  Node(int v) { value = v; } 
p u b l i c  vo id  setValue(int v)  { value = v; } 
p u b l i c  vo id  setNext(Node n)  { next = n;  } 
OReadOr~ly 
p u b l i c  i n t  getvalue() { r e t u r n  value; } 
QReadOnly 
p u b l i c  Node getNext() { r e t u r n  next; } 
} 
p u b l i c  c l a s s  IntSet { 
p r i v a t e  Node head; 
p u b l i c  IntSetOOP() { 
Node min = n e w  Node(Integer.MIN-VALUE); 
Node max = n e w  Node(Inteeer.MAX-VALUE): 
head = min; 
} 
// Continued in nez t  column 
DAtonlic 
p u b l i c  b o o l e a n  add ( in t  v) { 
Node prev = head; 
Node next = prev.getNext(); 
w h i l e  (next.getValue() < v) { 
prev = next; 
next = prev.getNext(); 
} 
if (next.getValue() == v) 
r e t u r n  False; 
Node n = n e w  Node(v); 
n.setNext(prev.getNext()); 
prev.setNext(n); 
r e t u r n  t r u e ;  
} 
(FSAtomic 
p u b l i c  b o o l e a n  contains(int v) { 
Node prev = head; 
Node next = prev.getNext(); 
w h i l e  (next.getValue() < v) { 
prev = next; 
next = prev.getNext(); 
1 
I 
r e t u r n  (next.getValue() == v); 
} 
} 
Figure  6: In teger  se t  implementat ion using declarat ive S T M  suppor t .  
50 accounts, 0% read-all 50 accounts, 10% read-all 1024 accounts. 10% read-all 











1 2  4 8 1 6 3 2  1 2  4 8 1 6 3 2  1 2  4 8 1 6 3 2  
Threads Threads Threads 
8 a . , , ,  
F.. 







I I I I I I 
to  read all elements of a linked-list-based queue, it needs 
to  validate its read set during the transaction to guarantee 
that it terminates even when the queue is being modified by 
other transactions. 
1 I I I I 
SI-STM + 
If the number of accounts is large (1024) and, as a result, Eager ASTM -.-X-. SXM --+-- 
write contention and the chance that a n  object gets updated SXM, disjoint accesses - - -x- - 
is low, SI-STM and single-validate ASTM outperform the 
r., 
..'>c 
other STM variants. However, if there is more than one _,.' 
thread per CPU, the throughput of the STMs using invisible ,. r., 
reads decreases because preemption of threads decreases the ,x.,. ......... ." 
. rO. . . .+. . . . . . . . . . . . . . . . .  
chance of optimistically obtaining a consistent view. ,_c., 
To highlight the differences between STM designs that use ,_. X.  
visible and invisible reads, Figure 8 shows the CPU time re- 
... quired for one read operation for read-only transactions of .... x .............x3f3f3f........... 
different sizes. In this micro-benchmark, 8 threads read the 
7 A 
given number of objects. All transactions read the same ob- o I I I I I I I 
, , 8 8 .  
>, 
b 
# 8 6 
ha11 - -  la-- 
Large locks ---C- - 
SI-STM + 
Eager ASTM --f -. 
Eager ASTM, single validate -.f3-. 
SXM 
jects (with the exception of the SXM benchmark run with 20 40 60 80 100 120 140 
disjoint accesses) and there are no concurrent updates to Number of objects read by a transaction 
these objects. The fixed overhead of a transaction gets neg- F igure  8: SI-STM r e a d  overhead 
ligible when the number of objects read during the trans- 
Q. 
x. 
.... x.. . 
















































































ts, ll ts, - l ts, - l
'<t••












. -+- -- ---+-- -. -;:" ~:-::_~+. -----





































0.017 SI-STM -t .- SXM 
0.016 .- 
3 0.015 
g 0.014 b 
Number of objects written by a transaction 
Figure 9: SI-STM write overhead 
action is high. SXM's visible reads have a higher overhead 
than SI-STM's invisible reads. This overhead consists of the 
costs of the CAS operation and possible cache misses and 
CAS failures if transactions on different CPUs add them- 
selves to the reader list of the same object. ASTM has to 
guarantee the consistency of reads by validating all objects 
previously read in the transaction, which increases the over- 
head of read operations when transactions get larger. Note 
that, although not shown here, ASTM transactions with 
only a single validate at the end of each transaction perform 
very similar to SI-STM. 
SI-STM requires a central counter for the timestamps that 
it needs for update transactions. SXM and ASTM do not 
need such a counter, which is a source of contention if the 
rate of commits is high. Figure 9 shows the overhead of 
write operations in SI-STM by means of a micro-benchmark 
similar to the one used for Figure 8. However, now the 8 
threads write to disjoint, thread-local objects. Acquiring 
timestamps induces a small overhead, which, however, gets 
negligible when at least 10 objects are written by a trans- 
action. Furthermore, the overhead is smaller than the costs 
of a single write operation. However, the results in Figure 8 
and Figure 9 are of course hardware-specific. 
Figure 10 shows throughput results for two micro-bench- 
marks that are often used to evaluate STM implementations, 
namely integer sets implemented via sorted linked lists and 
skip lists. Each benchmark consists of read transactions, 
which determine whether an element is in the set, and up- 
date transactions, which either add or remove an element. 
For SI-STM, we present two results. First, modified imple- 
mentations of the integer sets that operate correctly when 
the STM provides snapshot-isolation, labeled as SI-safe; 
these variants were obtained by adding some write accesses 
and using the correctness conditions given in [16]. Second, 
the original (sequential) implementations (see Figure 6) that 
require strict transactional consistency and for which SI- 
STM is configured to ensure linearizability. Distinguishing 
between these variants allows us to show the performance 
impact of snapshot isolation and inexpensive validation sep- 
arately. We do not release objects early. Although early 
release decreases the of conflicts, it can mainly 
be used in cases in which the access path to an object is 
known. We use the linked list to conveniently model trans- 
actions in which a modification takes place, which depends 
on a large amount of data that might be modified by other 
transactions. Note that, for this type of transactions, lazily 
acquiring updated objects makes not much of a difference 
because the update operations are near the end of the trans- 
action. Thus, using Eager ASTM should give representative 
results. 
For the skip list, STMs using invisible reads (ASTM and 
SI-STM) show good scalability and outperform SXM, which 
suffers from the contention on the reader lists. However, the 
transactions in the linked list benchmark are quite large (the 
integer sets contain 250 elements) and ASTMs validation is 
expensive. SI-STM, on the contrary, uses version informa, 
tion to compute the validity range much faster and scales 
well up to the number of available CPUs. 
The SI-safe variants perform better than the original im- 
plementations if the number of objects read by a transaction 
is large, as in the linked list benchmark. On the other hand, 
the overhead of the validation phase required to ensure lin- 
earizability is negligible in the skip list benchmark, where 
the number of read objects is smaller. Furthermore, transac- 
tions are shorter, which decreases the probability of concur- 
rent updates resulting in a failed validation. SI-STM enables 
the user to choose between both alternatives depending on 
application specifics and performance requirements. Note 
that SI-STM with linearizability still outperforms SXM and 
ASTM in most cases: applications can benefit from SI-STM 
even without using snapshot isolation and its additional en- 
gineering costs. 
For all benchmark results for SI-STM shown here, the 
maximum number of versions kept per object was 8. During 
several tests with these benchmarks, we have noticed that 
the maximum number of versions often had only a small in- 
fluence on the throughput. Keeping one or two versions was 
sufficient to  achieve similar and sometimes even better re- 
sults than with 8 versions. We also found that, in our bench- 
marks, single-version STMs and SI-STM are throughput- 
wise similarly affected by garbage collection overheads when 
the heap size is small. We are currently investigating how 
weak references and proactively extending the validity range 
affect the properties of SI-STM. 
7. CONCLUSION 
We have designed, implemented, and evaluated a soft- 
ware transaction memory architecture (SI-STM) based on a 
variant of snapshot isolation. In this variant we use a con- 
tention manager to support the first-committer-wins princi- 
ple. The performance of SI-STM is competitive even with 
manual lock-based implementations that do not have the 
overhead of AOP. Our benchmarks point out that SI-STM 
shows good performance in particular for transaction work- 
loads with long transactions. Our novel lazy snapshot algo- 
rithm can reduce the validation cost in comparison to other 
STMs with invisible reads like ASTM. 
8. REFERENCES 
[I] H. Berenson, P. Bernstein, J. Gray, J. Melton, 
E. O'Neil, and P. O'Neil. A critique of ANSI SQL 
isolation levels. In Proceedings of SIGMOD, pages 
1-10, 1995. 
[2] P. A. Bernstein, V. Hadzilacos, and N. Goodman. 
Concunency Control and Recovery i n  Database 
Systems. Addison-Wesley, 1987. 
[3] J. Cachopo and A. RiteSilva. Versioned boxes as the 
basis for memory transactions. In Proceedings of 
SCOOL, 2005. 
5 10 15 20 25 30

































i . , -
























Linked List. 0% writes 
'O 7 





20 SI-STM SI-safe -+- 
10 .* .... 3----+.. *-.-;+$-.-.* 
0 
--. 
1 2 4 8 16 32 
Threads 
Skip List. 0% writes 
I I I I I I  :p - I I I 
- + 2' 
;<. ..- ,+....+... - m' .,-. .*... 
"+ 
I I I I I I  
4 8 16 32 
Threads 
5 
y--.--* * * 
0 I I I I I  
1 2  4 8 1 6 3 2  
Threads 
Skip List. 20% writes 
20 Ld-22- 1 2 4 8 16 32 
Threads 
2 
*.-.-%.-.-I.- .-. *-.--* 
0 I I I I I 
1 2  4 8 1 6 3 2  
Threads 
Sklp List. 100% writes 
SXM --+-- 
Eager ASTM 
10 SI-STM + SI-STM SI-safe -+- 
0- 
1 2  4 8 1 6 3 2  
Threads 
Figure 10: Throughput results for the linked list (top) and skip list (bottom) benchmarks. 
[4] C. Cole and M. Herlihy. Snapshots and software . . 
transactional memory. Science of Computer 
Programming, 2005. To appear. 
[5] S. Elnikety, W. Zwaenepoel, and F. Pedone. Database 
replication using generalized snapshot isolation. In 
Proceedings of SRDS,  pages 73-84, Oct 2005. 
[6] A. Fekete, D. Liarokapis, E. O'Neil, P. O'Neil, and 
D. Shasha. Making snapshot isolation serializable. 
ACM Transactions on Database Systems, 30(2), 2005. 
[7] P. Felber and M. Reiter. Advanced concurrency 
control in Java. Concurrency and Computation: 
Practice &' Experience, 14(4):261-285, 2002. 
[8] R. Guerraoui, M. Herlihy, M. Kapalka, and 
B. Pochon. Robust contention management in 
software transactional memory. In Proceedings of 
SCOOL, 2005. 
[9] R. Guerraoui, M. Herlihy, and B. Pochon. 
Polymorphic contention management. In Proceedings 
of DISC, Sep 2005. 
[lo] R. Guerraoui, M. Herlihy, and S. Pochon. Toward a 
theory of transactional contention managers. In 
Proceedings of PODC, Jul 2005. 
[ll] T. Harris and K. Fraser. Language support for 
lightweight transactions. In Proceedings of OOPSLA, 
pages 388402, Oct 2003. 
1121 M. Herlihy. The transactional manifesto: software 
- - 
engineering and non-blocking synchronization. In 
Proceedings of PLDI, 2005. 
[13] M. Herlihy, V. Luchangco, M. Moir, and W. Scherer 
111. Software transactional memory for dynamic-sized 
data structures. In Proceedings of PODC, pages 
92-101, Jul 2003. 
[14] M. P. Herlihy and J .  M. Wing. Linearizability: a 
correctness condition for concurrent objects. ACM 
Trans. Program. Lung. Syst., 12(3):463492, 1990. 
[15] G. Kiczales, J .  Lamping, A. Menhdhekar, C. Maeda, 
C. Lopes, J.-M. Loingtier, and J .  Irwin. 
Aspect-oriented programming. In Proceedings of 
ECOOP, 1997. 
[16] S. Lu, A. Bernstein, and P. Lewis. Correct execution 
of transactions at  different isolation levels. IEEE 
Transactions on Knowledge and Data Engineering, 
16(9):1070-1081, 2004. 
[17] V. J .  Marathe, W. N. S. 111, and M. L. Scott. 
Adaptive software transactional memory. In 
Proceedings of DISC, pages 354-368, 2005. 
[18] W. Scherer I11 and M. Scott. Advanced contention 
management for dynamic software transactional 
memory. In Proceedings of PODC, pages 240-248, Jul 
2005. 
[19] W. N. Scherer I11 and M. L. Scott. Contention 
management in dynamic software transactional 
memory. In Proceedings of the PODC Workshop on 
Concurrency and Synchronization i n  Java Programs, 
Jul 2004. 
[20] N. Shavit and D. Touitou. Software transactional 
memory. In Proceedings of PODC, Aug 1995. 












~ 30 ..+. +- -"+.
--lI(-- 6
I· -+- 10 '1-_
- ~ -+--.-.+_._-+ 4
>-~_-~;-~-~1·~.-.!t:-_::_-lI( __--JlE-----lIE-----*----*-----lI( lIE-----*-----lI(---__~----*-----lI(
6 16 3 6 16 3
r








/*......... )I'-----lIE-----* 50 --...-(~~....+----+.-._+
200 / 'lIE- ---lIE 100
/
40 "
150 J 80 JIE'-----)I(/
/
-,j- •••• -+- ----+ 30/ 60
100 _)If<
" . .f---.-+- .. )I( 20 .. +..... " 40 --*- ..
50 lIE ". -'I-. I· -+-. 20 I- f ~
0 0
2 4 8 16 32 16 3
i : s.
[4] . l . li .
f
i , .
[5] . . .
S, , .
[6] . t , . is, . .
. s . l .
s, ), .
[7] . . it .
. t ti :
€9 i ce, 200 .




[9] . rr i, . , .
, .
[10] . rr i, . li , .
.
, .
[11] . rri . .
li t i t t ti . i f ,
8- , t .
[1 ] . rli . t ti l i t : t
i .
I, .
[13] _ rli , . , . , .





s. a t_ ):463- ,
5]













. i f t




What Really Makes Transactions Faster? 
Dave Dice 
Sun Microsystems Laboratories 
1 Network Drive 
Burlington, MA 01 803-0903 
ABSTRACT 
There has been a flurry of recent work on the design of high 
performance software and hybrid hardware/software trans- 
actional memories (STMs and HyTMs). This paper reex- 
amines the design decisions behind several of these state- - 
of-the-art algorithms, adopting some ideas, rejecting others, 
all in an attempt to make STMs faster. 
The results of our evaluation led us to the design of a 
transactional locking (TL) algorithm which we believe to be 
the simplest, most flexible, and best performing STM/HyTM 
to  date. I t  combines seamlessly with hardware transactions 
and with any system's memory life-cycle, making it an ideal 
candidate for multi-language deployment today, long before 
hardware transactional support becomes commonly avail- 
able. 
Most important of all however were the results we derived 
from a comprehensive comparison of the performance of non- 
blocking, lock-based, and Hybrid STM algorithms versus 
fine-grained hand-crafted ones. Contrary to our intuitions, 
concurrent code generated in a mechanical fashion using our 
TL  algorithm and several other STMs, scaled better than 
the hand-crafted fine-grained lock-based and lock-free data 
structures, even though their throughput was lower. We 
found that it was the lower latency of the hand-crafted data 
structures that made them faster than STMs, and not better 
contention management or optimizations based on the pro- 
grammer's understanding of the particulars of the structure. 
This holds great promise for future mechanical generation 
of concurrent code using hardware transactional support. 
1. INTRODUCTION 
A goal of current multiprocessor software design is to  in- 
troduce parallelism into software applications by allowing 
operations that do not conflict in accessing memory to pro- 
ceed concurrently. The key tool in designing concurrent 
data structures has been the use of locks. Unfortunately, 
course grained locking is easy to program with, but pro- 
vides very poor performance because of limited parallelism. 
Nir Shavit 
Sun Microsystems Laboratories 
1 Network Drive 
Burlington, MA 01 803-0903 
Fine-grained lock-based concurrent data structures perform 
exceptionally well, but designing them has long been rec- 
ognized as a difficult task better left to experts. If concur- 
rent programming is to  become ubiquitous, researchers agree 
that one must develop alternative approaches that simplify 
code design and verification. This paper is interested in 
'Lmechanical" methods for transforming sequential code or 
course-grained lock-based code into concurrent code. By 
mechanical we mean that the transformation, whether done 
by hand, by a preprocessor, or by a compiler, does not re- 
quire any program specific information (such as the p r e  
grarnmer's understanding of the data flow relationships). 
Moreover, we wish to  focus on techniques that can be de- 
ployed to  deliver reasonable performance across a wide range 
of systems today, yet combine easily with specialized hard- 
ware support as it becomes available. 
1.1 Transactional Programming 
The transactional memory programming paradigm [19] is 
gaining momentum as the approach of choice for replac- 
ing locks in concurrent programming. Combining sequences 
of concurrent operations into atomic transactions seems to 
promise a great reduction in the complexity of both pro- 
gramming and verification, by making parts of the code 
appear to be sequential without the need to program fine- 
grained locks. Transactions will hopefully remove from the 
programmer the burden of figuring out the interaction among 
concurrent operations that happen to conflict when access- 
ing the same locations in memory. Transactions that do 
not conflict in accessing memory will run uninterrupted in 
parallel, and those that do will be aborted and retried with- 
out the programmer having to worry about issues such as 
deadlock. There are currently proposals for hardware im- 
plementations of transactional memory (HTM) (3, 11, 19, 
301, purely software based ones, i.e. software transactional 
memories (STM) [9, 13, 16, 18, 22, 23, 27, 31, 32, 33, 341, 
and hybrid schemes (HyTM) that combine hardware and 
software 14, 21, 271.' 
The dominant trend among transactional memory designs 
seems to be that the transactions provided to the program- 
mer, in either hardware or software, should be "large scale", 
that is, unbounded, and dynamic. Unbounded means that 
there is no limit on the number of locations accessed by the 
transaction. Dynamic (as opposed to static) means that the 
set of locations accessed by the transaction is not known in 
advance and is determined during its execution. 
Providing large scale transactions in hardware tends to 
'A broad survey of prior art can be found in [13, 22, 291. 


























r - r i l - i t rr t .
il r,
ro-






[ , , ,
], .
) , , , , , , ],
[4, , V
i .
1 , , ]
introduce large degrees of complexity into the design [19, 
30, 3, 111. Providing them efficiently in software is a diffi- 
cult task, and there seem to be numerous design parameters 
and approaches in the literature [9, 13, 16, 18, 23, 27, 31, 
321, as well as requirements to combine well with hardware 
transactions once those become available [4, 21, 271. 
1.2 Software Transactional Memory 
The first STM design by Shavit and Touitou [33] provided 
a non-blocking implementation of static transactions. They 
had transactions maintain transaction records with read- 
write information, access locations in address order, and had 
transactions help those ahead of them in order to guarantee - 
progress. The first non-blocking dynamic schemes were pro- 
posed by Herlihy et a1 [18] in their dynamic STM (DSTM) 
and by Raser and Harris in their object-based STM [14] 
(OSTM). The original DSTM was an excellent proof-of- 
concept, and the first obstruction-free [17] STM, but in- 
volved two levels of indirection in accessing data, and had 
a costly ~ a v a ~ ~ - b a s e d  implementation. This Java-based 
implementation was improved on later by the ASTM of 
Marathe et a1 [23]. The OSTM of Raser and Harris took a 
slightly different programming approach than DSTM, allow- 
ing programmers to open and close objects within a trans- 
action in order to improve performance based on the pro- 
grammer's understanding of the data structure being im- 
plemented. We found that the latest C-based versions of 
OSTM, which involve one level of indirection in accessing 
data, are the most efficient non-blocking STMs available to 
date [13]. A key element of being non-blocking is the main- 
tenance of publicly shared transaction records with undo or 
copy-back information. This tends to make the structures 
more susceptible to cache behavior, hurting overall perfor- 
mance. As our empirical data will show however, OSTM 
performs reasonably well across the concurrency range. 
A recent paper by Ennals [9] suggested that on modern op- 
erating systems, deadlock avoidance is the only compelling 
reason for making transactions non-blocking, and that there 
is no reason to provide it for transactions a t  the user level. 
We second this claim, noting that mechanisms already exist 
whereby threads might yield their quanta to other threads 
and that Solaris' schedctl allows threads to transiently de- 
fer preemption while holding locks. Ennals [9] proposed an 
all-software lock-based implementation of software transac- 
tional memory using the object-based approach of [15]. His 
idea was to have transactions acquire write locks as they 
encounter locations to be written, writing the new values in 
place and having pointers to an undo set that is not shared 
with other threads (we call this approach encounter order, 
it is typically used in conjunction with an undo set [31]). A 
transaction collects a read-set which it validates before com- 
mitting and releasing the locks. If a transaction must abort, 
its executing thread can restore the values back before re- 
leasing the locks on the locations being written. The use of 
locks eliminates the need for indirection and shared trans- 
action records as in the non-blocking STMs, it still requires 
however a closed memory system. Deadlocks and livelocks 
are dealt with using timeouts and the ability of transactions 
to request other transactions to abort. 
As we show, Ennals's algorithm exhibits impressive per- 
formance on several benchmarks. I t  is not clear why his work 
has not gained more recognition. A recent paper by Saha et 
a1 [31], concurrent and independent of our own work, uses 
a version of the Ennals's lock-based algorithm within a run- 
time system. I t  uses encounter order, but also keeps shared 
undo sets to allow transactions to actively abort others. 
Moir [27] has suggested that the pointers to transaction 
records in non-blocking transactions can be used to coor- 
dinate hardware and software transactions to form hybrid 
transactional schemes. His HybridTM scheme has an im- 
plementation that acquires locks in encounter order. 
Our paper reexamines the design decisions behind these 
state-of-the-art STM algorithms. Building on the body of 
prior art together with our new understanding of what makes 
software transactions fast, we introduce the transactional 
locking (TL) algorithm which we believe to be the simplest, 
most flexible, and best performing STM/HyTM to date. 
1.3 Our Findings 
The following are some of the results and conclusions pre- 
sented in this paper: 
Ennals [9] suggested to build deadlock-free lock-based 
STMs rather than non-blocking ones [13, 271. Our em- 
pirical findings support Ennals's claims: non-blocking 
transactions [13, 271 were less efficient than our TL 
lock-based ones on a variety of data structures and 
across concurrency ranges, even when they used a more 
complex yet advantageous non-mechanical program- 
ming interface [13]. Given that, as we show, locks 
provide a simple interface to hardware transactions, 
we recommend that the design of HyTMs shift from 
non-blocking to lock-based algorithms. 
Both Ennals and Saha et a1 [9, 311 have transactions 
acquire write locks as they encounter them (an "undo- 
set" algorithm). Saha et al [31] claim that this is a con- 
scious design choice. Both of the above papers failed 
to observe that encounter order transactions perform 
well on uncontended data structures but degrade on 
contended ones. We use variations of our TL algorithm 
to show that this degradation is inherent to encounter 
order lock acquisition. 
In its default operational mode, our new TL algorithm 
acquires locks only a t  commit time, using a Bloom fil- 
ter [5] for fast look-aside into the write-buffer to allow 
reads to always view a consistent state of its own mod- 
ified locations. Slow look-aside was cited by Saha et 
a1 [33] as a reason for choosing encounter order lock- 
ing and undo writing in their algorithm (one should 
note though that we do not support nesting in our 
STM). As we explain, unlike encounter order locking 
which seems to require type-stable memory or special- 
ized malloc/free implementations, commit time lock- 
ing fits well with the memory lifecycle in languages like 
C and C++, allowing transactionally accessed mem- 
ory to be moved in and out of the general memory pool 
using regular malloc and free operations. 
Of all the algorithms we tested, lock-free, or lock- 
based, the TL algorithm which acquires locks at com- 
mit time, is the only one that exhibits scalability across 
all contention ranges. Moreover, we found the advan- 
tage of encounter order algorithms, when they do ex- 
hibit better performance, to be small enough so as to 
bring us to conclude that even from a pure perfor- 
i t l r r s l it i t t i [ ,
, , ]. i tl
lt t , t t i t
, , , , , , ,




t t i .







tl Java™-based l t ti .
i l t ti
t l 3]. Fra













r l i ,
i r l.
,
l ri ' tl







t i i l 1]).
l
i s. t,




r l t . l li l
r lt it i ti t t ilit f tr ti
t rt.
, l ' l it i it i i




ti t . t , t l
t t ll tr ti t ti l rt t rs.
i 7] t i t t t ti
i l i t ti t
i i
t ti l . i i i
































i it ll li l l li
, ll i t ti ll
t i t f t l l
i l ll ti .
• ll t l rit t t , l -fr , l -
,
it ti , i t l t t i it l ilit r
ll t ti . ,
t f t l it , t
i it tt r rf r , t s ll s s t
i t l t t r-
mance standpoint, one should always default to using 
commit time locking. 
Both Ennals and Saha et a1 [9, 311 provide mechanisms 
for one transaction to abort another to allow progress. 
In the case of Saha et a1 this mechanism might add a 
significant cost to the implementation because write- 
sets must be shared so one transaction can completely 
undo another. We claim these mechanisms are unnec- 
essary, and show that they can be effectively replaced 
by time-outs. 
Perhaps most importantly, we show that concurrent 
code generated mechanically using our new TL algo- 
rithm has scalability curves that are superior to those 
of all fine-grained hand-crafted data structures even 
when varying size and contention level. This implies 
that contrary to our belief, it is the overhead of the 
STM implementations (measured, for example, by sin- 
gle thread performance cost) that limits their perfor- 
mance, not the superior contention management hand- 
crafted structures can deliver based on the program- 
mer's understanding of the data structures (This is not 
to say that there aren't structures where hand-crafting 
will increase scalability to a point where it dominates 
performance). Lower overheads benefit transactions 
in two ways: (1) shorter transactions are less exposed 
to interference and (2) shorter transactions imply a 
higher rate of arrival at  the commit point. We are 
in the process of collecting more data to support this 
claim. 
Finally, our findings bode well for HTM support, which 
we expect will suffer from the same abort rates as our 
TL algorithm, yet will reduce the overhead of opera- 
tions significantly. For HTM designers, our findings 
suggest that hardware transactional design should fo- 
cus on overhead reduction. 
In summary, TL's superior performance together with the 
fact that it combines seamlessly with hardware transactions 
and with any system's memory life-cycle, make it an ideal 
candidate for multi-language deployment today, long before 
hardware transactional support becomes commonly avail- 
able. 
2. TRANSACTIONAL LOCKING 
The transactional locking approach is thus that rather 
than trying to improve on hand-crafted lock-based imple- 
mentations by being non-blocking, we try and build lock- 
based STMs that will get us as close to their performance 
as one can with a completely mechanical approach, that is, 
one that simplifies the job of the concurrent programmer. 
Our algorithm operates in two modes which we will call 
encounter mode and commi t  mode. These modes indicate 
how locks are acquired and how transactions are committed 
or aborted. We will begin by describing our commit mode 
algorithm, later explaining how TL operates in encounter 
mode similar to algorithms by Ennals [9] and Saha et a1 
[31]. The availability of both modes will allow us to show 
the performance differences between them. 
We associate a special versioned-write-lock with every trans- 
acted memory location. A versioned-write-lock is a simple 
single-word spinlock that uses a compare-and-swap (CAS) 
operation to acquire the lock and a store to release it. Since 
one only needs a single bit to indicate that the lock is taken, 
we use the rest of the lock word to hold a version number. 
This number is incremented by every successful lock-release. 
In encounter mode the version number is displaced and a 
pointer into a threads private undo  log is installed. 
We allocate a collection of versioned-write-locks. We use 
various schemes for associating locks with shared memory: 
per object (PO), where a lock is assigned per shared object, 
per stripe (PS), where we allocate a separate large array of 
locks and memory is stripped (divided up) using some hash 
function to map each location to a separate stripe, and per 
word (PW) where each transactionally referenced variable 
(word) is collocated adjacent to a lock. Other mappings 
between transactional shared variables and locks are pos- 
sible. The PW and PO schemes reauire either manual or 
compiler-assisted automatic put of lock fields whereas PS 
can be used with unmodified data structures. Since in gen- 
eral P O  showed better performance than PW we will focus 
on PO and do not discuss PW further. PO m i ~ h t  be im- - 
plemented, for instance, by leveraging the header words of 
JavaTM objects [2, 81. A single PS stripe-lock array may be 
shared and used for different TL data structures within a 
single address-space. For instance an application with two 
distinct TL red-black trees and three TL hash-tables could 
use a single PS array for all TL locks. As our default map- 
ping we chose an array of 2" entries of 32-bit lock words 
with the mapping function masking the variable address 
with "Ox3FFFFC1' and then adding in the base address of 
the lock array to derive the lock address. 
The following is a description of the PS algorithm al- 
though most of the details carry through verbatim for PO 
and PW as well. We maintain thread local read- and write- 
sets as linked lists. A read-set entry contains the address of 
the lock and the observed version number of the lock asso- 
ciated with the transactionally loaded variable. A write-set 
entry contain the address of the variable, the value to be 
written to the variable, and the address of the associated 
lock. The write-set is kept in chronological order to avoid 
write-after-write hazards. 
2.1 Commit Mode 
We now describe how TL executes a sequential code frag- 
ment that was placed within a TL transaction. We use our 
preferred commit mode algorithm. As we explain, this mode 
does not require type-stable garbage collection, and works 
seamlessly with the memory life-cycle of languages like C 
and C++. 
1. Run the transactional code, reading the locks of all 
fetched-from shared locations and building a local read- 
set and write-set (use a safe load operation to avoid 
de-referencing invalid pointers as a result of reading an 
inconsistent view of memory). 
A transactional load first checks (using a Bloom filter 
(51) to see if the load address appears in the write-set. 
If so the transactional load returns the last value writ- 
ten to the address. This provides the illusion of pro- 
cessor consistency and avoids so-called read-after-write 
hazards. If the address is not found in the write-set the 
load operation then fetches the lock value associated 
with the variable, saving the version in the read-set, 
and then fetches from the actual shared variable. If the 
ance standpoint, one should al ays default t using
it ti l i .
• t l t l [ , ] r i i
f r tr ti t rt t r t ll r r ss.
t f t l t i i i t
i ifi t t t t i l t ti rit -
sets st s r s tr s ti c letel
t r. l i t i -
r , t t t ti l l
ti - ts.
• t i t tl , t
l ilit i t
f ll fi - r i - r ft t tr t r
i i t ti l l. i i li
t t t t li f, it i t r t
s red, l ,
l t li it t i
,







• i ll , i s ,
i r
l rit ,


















i t i l i it I it t
ti . it -l ck
i l i l )
operation to acquire the lock a a store to release it. ince
l s si l it t i i t t t t l is t ,
e se t rest f t e l c r t l a ersi er.
is er is i cre e te every successful lock-release.
I e c ter e t e ersi er is is lace a a
i t r i t t r ri t l i i t ll .
ll t ll ti f rsi - rite-l cks. s
ri f r i ti l it r r :
r j t ( ), r l is ssi r s r j t,
t i ), ll t t l f
l i t i ( i i ) i
f ti t l ti t r t tri , er
t ti ll i l
( r ) i ll t j t t l . t r i
i l l
. q
il i t t ti t f l i l
l tt ill
i . ight i
l t , i t , l i r f























[ ] f t -set.
f t
. f
r i t i - ll t it
. f
l ti t t l l i te
i l , i i i t,
t t t t l i l . f t
transactional load operation finds the variable locked 
the load may either spin until the lock is released or 
abort the operation. 
Transactional stores to shared locations are handled 
by saving the address and value into the thread's 1 e  
cal write-set. The shared variables are not modified 
during this step. That is, transactional stores are de- 
ferred and contingent upon successfully completing the 
transaction. During the operation of the transaction 
we periodically validate the read-set. If the read-set 
is found to be invalid we abort the transaction. This 
avoids the possibility of a doomed transaction (a trans- 
action that has read inconsistent global state) from 
becoming trapped in an infinite loop. 
2. Attempt to commit the transaction. Acquire the locks 
of locations to be written. If a lock in the write-set 
(or more precisely a lock associated with a location 
in the write-set) also appears in the read-set then the 
acquire operation must atomically (a) acquire the lock 
and, (b) validate that the current lock version subfield 
agrees with the version found in the earliest read-entry 
associated with that same lock. An atomic CAS can 
accomplish both (a) and (b). Acquire the locks in 
any convenient order using bounded spinning to avoid 
indefinite deadlock. 
3. Re-read the locks of all read-only locations to make 
sure version numbers haven't changed. If a version 
does not match, roll-back (release) the locks, abort 
the transaction, and retry. 
4. The prior observed reads in step (1) have been vali- 
dated as forming an atomic snapshot of memory [I]. 
The transaction is now committed. Write-back all the 
entries from the local write-set to the appropriate shared 
variables. 
5. Release all the locks identified in the write-set by atom- 
ically incrementing the version and clearing the write- 
lock bit (using a simple store). 
A few things to note. The write-locks have been held for 
a brief time when attempting to commit the transaction. 
This helps improve performance under high contention. The 
Bloom filter allows us to determine if a value is not in the 
write-set and need not be searched for by reading the sin- 
gle filter word. Though locks could have been acquired in 
ascending address order to avoid deadlock, we found that 
sorting the addresses in the write-set was not worth the ef- 
fort. 
2.2 Encounter Mode 
The following is the TL encounter mode transaction. For 
reasons we explain later, this mode assumes a type-stable 
closed memory pool or garbage collection. 
1. Run the transactional code, reading the locks of all 
fetched-from shared locations and building a local read- 
set and write-set (the write-set is an undo set of the 
values before the transactional writes). 
Transactional stores to shared locations are handled 
by acquiring locks as the are encountered, saving the 
address and current value into the thread's local write- 
set, and pointing from the lock to the write-set entry. 
The shared variables are written with the new value 
during this step. 
A transactional load checks to see if the lock is free or 
is held by the current transaction and if so reads the 
value from the location. There is thus no need to look 
for the value in the write-set. If the transactional load 
operation finds that the lock is held it will spin. During 
the operation of the transaction we periodically vali- 
date the read-set. If the read-set is found to be invalid 
we abort the transaction. This avoids the possibility 
of a doomed transaction (a transaction that has read 
inconsistent global state) from becoming trapped in an 
infinite loop. 
2. Attempt to commit the transaction. Acquire the locks 
associated with the write-set in any convenient order, 
using bounded spinning to avoid deadlock. 
3. Re-read the locks of all read-only locations to make 
sure version numbers haven't changed. If a version 
does not match, restore the values using the write-set, 
roll-back (release) the locks, abort the transaction, and 
retry. 
4. The prior observed reads in step (1) have been vali- 
dated as forming an atomic snapshot of memory. The 
transaction is now committed. 
5. Release all the locks identified in the write-set bv atom- 
ically incrementing the version and clearing the write- 
lock bit. 
We note that the locks in encounter mode are held for a 
longer duration than in commit mode, which accounts for 
weaker performance under contention. However, one does 
not need to look-aside and search through the write-set for 
every read. 
2.3 Contention Management 
As described above TL admits live-lock failure. Consider 
where thread Tl 's  read-set is A and its write-set is B. T2's 
read-set is B and write-set is A. T1  tries to commit and locks 
B. T2 tries to commit and acquires A. T1  validates A, in its 
read-set, and aborts as a B is locked by T2. T2 validates B 
in its read-set and aborts as B was locked by TI .  We have 
mutual abort with no progress. To provide liveness we use 
bounded spin and a back-off delay at abort-time, similar in 
spirit to that found in CSMA-CD MAC protocols. The delay 
interval is a function of (a) a random number generated at 
abort-time, (b) the length of the prior (aborted) write-set, 
and (c) the number of prior aborts by the current thread for 
this transactional attempt. 
2.4 The Pathology of Transactional Memory 
Management 
For type-safe garbage collected managed runtime environ- 
ments such as Java any of the TL lock-mapping policies (PS, 
PO, or PW) and modes (Commit or Encounter) are safe, as 
the GC assures that transactionally accessed memory will 
only be released once no references remain to the object. In 
C or C++ TL preferentially uses the PS/Commit locking 
scheme to allow the C programmer to use normal malloc() 
and free() operations to manage the lifecycle of structures 
containing transactionally accessed shared variables. Using 























































Concurrent mixed-mode transactional and non-transactional 
accesses are proscribed. When a particular object is be- 
ing accessed with transactional load and store operations it 
must not be accessed with normal non-transactional load 
and store operations. (When any accesses to  an object are 
transactional, all accesses must be transactional). In PSI- 
Commit mode an object can exit the transactional domain 
and subsequently be accessed with normal non-transactional 
loads and stores, but we must wait for the object to quiesce 
before it leaves. There can be at most one transaction hold- 
ing the transactional lock, and quiescing means waiting for 
that lock to  be released, implying that all pending trans- 
actional stores to  the location have been"drainedn , before 
allowing the object to  exit the transactional domain and 
subsequently to  be accessed with normal load and store op- 
erations. Once it has quiesced, the memory can be freed and 
recycled in a normal fashion, because any transaction that 
may acquire the lock and reach the disconnected location 
will fail its read-set validation. 
To motivate the need for quiescing, consider the following 
scenario with PS/Commit. We have a linked list of 3 nodes 
identified by addresses A, B and C. A node contains Key, 
Value and Next fields. The data structure implements a tra- 
ditional key-value mapping. The key-value map (the linked 
list) is protected by TL using PS. Node A's Key field con- 
tains 1, its value field contains 1001 and its Next field refers 
to  B. B's Key field contains 2, its Value field contains 1002 
and its Next field refers to C. C's Key field contains 3, the 
value field 1003 and its Next field is NULL. Thread T 1  calls 
put(2, 2002). The TL-based put() operator traverses the 
linked list using transactional loads and finds node B with 
a key value of 2. T 1  then executes a transactional store 
into B.Value to change 1002 to  2002. Tl ' s  read-set con- 
sists of A.Key, A.Next, B.Key and the write-set consists of 
B.Value. T 1  attempts to  commit; it acquires the lock cover- 
ing B.Value and then validates that the previously fetched 
read-set is consistent by checking the version numbers in 
the locks converging the read-set. Thread T 1  stalls. Thread 
T 2  executes delete(2). The delete() operator traverses the 
linked list and attempts to splice-out Node B by setting 
A.Next to  C. T 2  successfully commits. The commit oper- 
ator stores C into A.Next. T2's transaction completes. T2 
then calls free(B). T 1  resumes in the midst of its commit 
and stores into B.Value. We have a classic modify-after-free 
pathology. To avoid such problems T2 calls quiesce(B) after 
the commit finishes but before free()ing B. This allows T l ' s  
latent transactional ST to  drain into B before B is free()ed 
and potentially reused. Note, however, that TL (using qui- 
escing) did not admit any outcomes that were not already 
possible under a simple coarse-grained lock. Any thread 
that attempts to  write into B will, at commit-time, acquire 
the lock covering B, validate A.Next and then store into B. 
Once B has been unlinked there can be at most one thread 
that has successfully committed and is in the process of writ- 
ing into B. Other transactions attempting to  write into B 
will fail read-set validation at commit-time as A.Next has 
changed. 
Consider another following problematic lifecycle scenario 
based on the A,B,C linked list, above. Lets say we're us- 
ing TL in the C language to  moderate concurrent access to  
the list, but with either P O  or P W  mode where the lock 
word(s) are embedded in the node. Thread T 1  calls put(2, 
2002). The TL-based put() method traverse the list and 
locates node B having a key value of 2. Thread T2 then 
calls delete(2). The delete() operator commits successfully. 
T2 waits for B to quiesce and then calls free(B). The mem- 
ory underlying B is recycled and used by some other thread 
T3. T 1  attempts to commit by acquiring the lock cover- 
ing B.Value. The lock-word is collocated with B.Value, so 
the the CAS operation transiently change the lock-word con- 
tents. T2 then validates the read-set, recognizes that A.Next 
changed (because of T l ' s  delete()) and aborts, restoring the 
original lock-word value. T 1  has cause the memory word 
underlying the lock for B.value to "flicker", however. Such 
modifications are unacceptable; we have a classic modify- 
after-free error. 
Finally, consider the following pathological scenario ad- 
mitted by PS/Encounter. T1 calls put(2,2002). Put() tra- 
verses the list and locates node B. T2 then calls delete(2), 
commits successfully, calls quiesce(B) and free(B). T1 ac- 
quires the lock covering B.Value, saves the original B.Value 
(1002) into its private write undo log, and then stores 2002 
into B.Value. Later, during read-set validation at commit- 
time, T1 will discover that its read-set is invalid and abort, 
rolling back B.Value from 2002 to  1002. As above, this con- 
stitutes a modify-after-free pathology where B recycled, but 
B.Value transiently "flickered" from 1002 to 2002 to  1002. 
We can avoid this problem by enhancing the encounter pro- 
tocol to  validate the read-set after each lock acquistion but 
before storing into the shared variable. This confers safety, 
but at the cost of additional performance. 
As such, we advocate using PS/Commit for normal C code 
as the lock-words (metadata) are stored separately in type- 
stable memory distinct from the data protected by the locks. 
This provision can be relaxed if the C-code uses some type 
of garbage collection (such as Boehm-style [6] conservative 
garbage collection for C, Michael-style hazard pointers [25] 
or Fraser-stye Epoch-Based Reclamation [lo]) or type-stable 
storage for the nodes. 
2.5 Mechanical Transformation of Sequential 
Code 
As we discussed earlier, the algorithm we describe can be 
added to code in a mechanical fashion, that is, without un- 
derstanding anything about how the code works or what the 
program itself does. In our benchmarks, we performed the 
transformation by hand. We do however believe that it may 
be feasible to automate this process and allow a compiler to 
perform the transformation given a few rather simple limi- 
tations on the code structure within a transaction. 
We note that hand-crafted data structures can always 
have an advantage over TL, as TL has no way of know- 
ing that prior loads executed within a transaction might no 
longer have any bearing on results produced by transaction. 
Consider the following scenario where we have a TL-protected 
hashtable. Thread T 1  traverses a long hash bucket chain 
searching for a the value associated with a certain key, it- 
erating over "next" fields. We'll say that T 1  locates the 
appropriate node at or near the end of the linked list. T2 
concurrently deletes an unrelated node earlier in the same 
linked list. T2 commits. At commit-time T 1  will abort be- 
cause the linked-list "next" field written to  by T2 is in T l ' s  
read-set. T 1  must retry the lookup operation (ostensibly 
locating the same node). Given our domain-specific knowl- 
edge of the linked list we understand that the lookup and 

































































allowed to operate concurrently with no aborts. A clever 
"hand over hand" hand-coded locking scheme would have 
the advantage of allowing this desired concurrency. Never- 
theless, as our empirical analysis later in the paper shows, 
in the data structure we tested, the beneficial effect of this 
added concurrency on overall application scalability does not 
seem to be as profound as one would think. 
2.6 Software-Hardware Inter-Operability 
Though we have described TL as a software based scheme, 
it can be made inter-operable with HTM systems. 
On a machine supporting dynamic hardware, transactions 
executed in hardware need only verify for each location that 
they read or write that the associated versioned-write-lock is 
free. There is no need for the hardware transaction to store 
an intermediate locked state into the lock word(s). For ev- 
ery write they also need to update the version number of 
the associated stripe lock upon completion. This suffices 
to provide inter-operability between hardware and software 
transactions. Any software read will detect concurrent mod- 
ifications of locations by a hardware writes because the ver- 
sion number of the associated lock will have changed. Any 
hardware transaction will fail if a concurrent software trans- 
action is holding the lock to write. Software transactions 
attempting to write will also fail in acquiring a lock on a 
location since lock acquisition is done using an atomic hard- 
ware synchronization operation (such as CAS or a single 
location transaction) which will fail if the version number of 
the location was modified by the hardware transaction. 
3. AN EMPIRICAL EVALUATION OF STM 
PERFORMANCE 
We present here the a comparison of algorithms represent- 
ing state-of-the-art non-blocking [13], lock-based [9] STMs 
on a set of microbenchmarks that include the now standard 
concurrent red-black tree structure [18], as well as concur- 
rent skiplists [13]and a concurrent shared queue [26]. 
The red-black tree tested with transactional locking was 
derived from the j ava. u t i l  . TreeMap implementation found 
in the Java 6.0 JDK. That implementation was written by 
Doug Lea and Josh Bloch. In turn, parts of the Java TreeMap 
were derived from the Cormen et a1 [7]. The skiplist was de- 
rived from Pugh [28]. We would have preferred to use the 
exact F'raser-Harris red-black tree but that code was writ- 
ten to to their specific transactional interface and could not 
readily be converted to a simple form. We use large and 
small versions of the data structures, with 20,000 keys or 
200 keys. We found little difference when we further in- 
creased the size of the trees a hundred-fold. 
The skiplist and red-black tree implementations expose a 
key-value pair interface of put, delete, and get operations. 
The put operation installs a key-value pair. If the key is not 
present in the data structure put will insert a new element 
describing the key-value pair. If the key is already present 
in the data structure put will simply update the value as- 
sociated with the existing key. The get operation queries 
the value for a given key, returning an indication if the key 
was present in the data structure. Finally, delete removes 
a key from the data structure, returning an indication if 
the key was found to be present in the data structure. The 
benchmark harness calls put, get and delete to operate on 
the underlying data structure. The harness allows for the 
proportion of put, get and delete operations to be varied by 
way of command line arguments, as well as the number of 
threads, trial duration, initial number of key-value pairs to 
be installed in the data structure, and the key-range. The 
key range describes the maximum possible size (capacity) of 
the data structure. 
The harness spawns the specified number of threads. Each 
of the threads loops, and in each iteration the thread first 
computes a uniformly chosen random number used to select, 
in proportion to command line argument mentioned above, 
if the operation to be performed will be a put, get or delete. 
The thread then generates a uniformly selected random key 
within the key range, and, if the operation is a put, a random 
value. The thread then calls put, get or delete accordingly. 
All threads operate on a single shared data structure. At 
the end of the timing interval specified on the command - 
line the harness reports the aggregate number of operations 
(iterations) completed by the set of threads. 
For our experiments we used a 16-processor Sun   ire^^ 
V890 which is a cache coherent multiprocessor with 1.35Ghz 
UltraSPARC-IV@ processors running solarisTM 10. 
Our benchmarked algorithms included: 
Mutex ,  SpinLock, MCSLock We implemented three vari- 
ations of mutual exclusion locks. Mutex is a Solaris 
Pthreads mutex, Spinlock is a lock implemented with 
a CAS based Test-and-test-and set [20], and MCSLock 
is the queue lock of Mellor-Crummey and Scott 1241. 
stm-fraser This is the state-of-the-art non-blocking STM 
of Harris and F'raser [13]. We use the name originally 
given to the program by its authors. It has a spe- 
cial record per object with a pointer to a transaction 
record. The transformation of sequential to transac- 
tional code is not mechanical: the programmer speci- 
fies when objects are transactionally opened and closed 
to improve performance. 
stm-ennals This is the lock-based encounter order object- 
based STM algorithm of Ennals taken from [9] and 
provided in LibLTX [13]. Note that LibLTX includes 
the original F'raser and Harris lockfree-lib package. I t  
uses a lock per object and a non-mechanical object- 
based interface of [13]. Though we did not have access 
to code for the Saha et a1 algorithm [31], we believe the 
Ennals algorithm to be a good representative this class 
of algorithms, with the possible benefit that the En- 
nals structures were written using the non-mechanical 
object-based interface of [13] and because unlike Saha 
et al, Ennals's write-set is not shared among threads. 
TL  Our new transactional locking algorithm. We use the 
notation TL/Enc/PO for example to denote a version 
of the algorithm that uses encounter mode lock acqui- 
sition and per-object locking. We alternately also use 
commit mode (CMT) or per-stripe locking (PS). 
hanke This is the hand-crafted lock-based concurrent re- 
laxed red-black tree implementation of Hanke [12] as 
coded by Fraser [13]. The idea of relaxed balancing 
is to uncouple the re-balancing from the updating in 
order to speed up the update operations and to allow 
a high degree of concurrency. The algorithm also uses 
an understanding of the structures data relationships 
to allow traversals of the data structure ignore the fact - 




























































(c) Small Skiplist 20%R0%/60% (d) Large Skiplist 2O%R0%/60% 
TL:CMT:PS + 
TL:CMT:PO ---X--- 
TL:Enc:PS ... %.- 





8 5000 stmfraser - o  -stm-ennals ----A - /'' . fraser CAS-based ,45 4000 w ' 
0 2 4 6 8 10 12 14 16 
threads 
0 2 4 6 8 10 12 14 16 
threads 
F igure  1: Throughpu t  of Skip  Lists w i t h  20% puts ,  20% deletes, a n d  60% gets  
fraser CAS-Based This is a lock-free skiplist due to Fraser Overall the TL algorithm in commit (CMT) mode us- 
1131 (A Java variant of this algorithm by Lea is in- ing PO locking does as well as the Ennals and TL 
cluded in JDK 1.6). encounter order (ENC) algorithms. 
MSaLock, SimpleLock Using the Mutex, Spinlock, and The performance of both the Ennals encounter order 
MCSLock locking algorithms to implement locks, we algorithm deteriorates as the data structure becomes 
show three variants of Michael and Scott's concurrent smaller (or as the number of modifying operations in- 
queue implemented 1261 using two separate locks for creases). Part (c) of Figure 2 shows that this is not a 
the head and tail pointers, and three additional vari- fluke. The encounter order TL algorithm exhibits the 
ants of a simple implementation using a single lock for same performance drop. 
both the head and tail. 
If one looks at the hiah contention benchmark in Fig- 
3.1 Locking vs Non-Blocking 
In our first benchmark we tested a skiplist tree data struc- 
ture in various configurations varying the fraction of modify- 
ing the fraction of puts, deletes, and get operations (method 
calls). We only show the case of 20% puts, 20% deletes, and 
60% gets because all other cases were very similar. As can be 
seen in Figure 1, Fraser's hand-crafted lock-free CAS-based 
implementation is has twice the throughput or more than 
the best STMs. Of the STM methods, the lock-based TL 
and Ennals STMs outperform all others. They are twice as 
fast as Fraser and Harris's lock-free STM, and more than 
five times faster than course grained locks. Though the 
single thread performance of STMs is inferior to that of 
locks, the crossover point is two threads, implying that with 
any concurrency, choose the STM. This benchmark indicates 
that improving both latency and single thread performance 
should be a goal of future STM design. The TL implementa- 
tion with encounter order and PO locks is the best performer 
on large data structures but is the first to  deteriorate as the 
size of the structure decreases, increasing contention. 
3.2 Encounter vs Commit and PO vs PS 
In our second benchmark we tested a red-black tree data 
structure in various configurations considered to be com- 
mon application usage patterns. As can be seen in Fig- 
ure 2, the TL lock-based algorithm outperforms Ennals's 
lock-based and Fraser's non-blocking STMs. On large data 
structures under contention (part (d)) it even outperforms 
Hanke's hand-crafted implementation. 
There are several interesting points to notice about these 
graphs. 
- - 
ure 3, where 80% of the operations modify the data 
structure and where 72% of all transactional references 
are loads, one can see that this continues to the ex- 
treme. Under high contention, Ennals's algorithm de- 
grades to become worst than any of the locks, the TL 
in encounter order and the lock-free Harris and Fraser 
STM stop scaling, the hand-crafted Hanke algorithm 
starts to  flatten out, and the two commit mode TL 
STMs continue to scale. The scalabilitv of the two 
commit mode TL algorithms gets further support if 
one looks at  the normalized throughput graphs of Fig- 
ure 5. It is quite clear that commit mode TL STMs 
are the onlv ones that show overall scalability. Our 
conclusion is that one should clearly not settle on en- 
counter order locking as the default as suggested by 
Saha et a1 1311, and pending investigation with larger 
set of benchmarks, it may well be that one could settle 
on always using commit time lock acquisition. 
Perhaps surprisingly, abort rates seem to have little ef- 
fect on overall scalability and performance. We present 
sample abort rate graphs in Figure 5 that correspond 
to the normalized scalability graphs above them. As 
can be seen PO does better than PS, a conclusion 
agrees with that of Saha et al [31.]. This is true even 
though, as seen in the large data structure abort rate 
graphs, PO introduces up to 50% more transaction 
failures than PS, yet the scalability of PO is better. 
Moreover, as can be seen in small red-black trees in 
which the failure rates increase tenfold when compared 
to large ones, TL/CMT/PO and TL/ENC/PS have 
the same abort rates yet TL/CMT/PO has twice the 















: ne: *.. _-----/
: ne:P 8-. . /-
_. _ ___----..-.....- ...
i l e -e- __ /
cslock xr-
























































, j j j j
j j
(a) Small Red-Black Tree 5%/5%190% (b) Large Red-Black Tree 5%/50/./90% 
mutex 
spinlock ft 
mcslock - 0 . -  
stm-fraser - -3 - 
2 4 6 8 10 12 14 16 
threads 
(c) Small Red-Black Tree 20%R0%160% 
TL:Enc:PS .-.t-. 
stmfraser o - - 
stm-ennals 
threads 
(d) Large Red-Black Tree 20%R0%160% 
stmfraser - O - 
stm-ennals ---A- 
hanke --*- 
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 
threads threads 
F igure  2:  Throughpu t  of Red-Black Tree  wi th  5% p u t s  a n d  5% deletes a n d  20% puts ,  20% deletes 
scalability of TL/ENC/PS and twice the performance Red- lack Tree 40%140%RO% Size=50 (18% stores. 72% loads) 
if one looks at  the graph in Part C of Figure 2. In 
general, Abort rates seem to be shadowed by the bet- 
ter locality of reference (accessing the lock and object 
together) provided by PO. Unfortunately, as we noted 
earlier, in languages like C and C++ one must use PS 
mode to allow interoperability with the normal malloc- 
free memory lifecycle. 
Our third benchmark in Figure 4 shows the performance 
of various locking and STM methods in implementing a 
shared queue algorithm. A shared queue is a natural exam- 
ple of a small data structure with high levels of contention. 
As we show, a TL queue mechanically generated from se- 
quential code delivers the same performance as the hand- o 2 4 6 8 10 12 14 16 
crafted Michael and Scott two Lock algorithm (MS2Lock). threads 
3.3 What Makes Transactions Faster? 
Figure  3: Throughpu t  of Red-Black Tree under  high 
The graphs in Figure 5 possibly contain our most telling contention 
data. These are graphs that depict the scalability of the var- 
ious methods by recasting the data we presented earlier in 
Figures 2 and 1 at  20%/20%/60%, normalizing the graphs are low. This is rather surprising, since we thought the 
based on the single thread performance. Contrary to all great advantage of hand-crafted data structures, as opposed 
of our conjectures, the STMs, and in particular TL using to mechanically generated STM code, was the programmers 
commit order, have the best overall scalability, outperform- ability to control contention based on his knowledge of the 
ing the hand-crafted red-black tree structures (results for data flow relationships. For example, both the Hanke lock- 
skiplists were similar). As can be seen, this scalability is based red-black tree and the Fraser lock-free skiplist, allow 
supported by the fact that the overall abort rates for TL traversals to ignore ongoing modifications to the data struc- 












































































s r ·····0· _..











































. l m ,
t ,










0 1 I I I I I I I I 
0 2 4 6 8 10 12 14 16 
threads 
Figure  4: Throughpu t  of Concurrent  Queue  
ture. However, as seen in Figure 5, and as we found out in 
similar benchmarks with 95% get operations (not presented 
here), TL, as well as other STMs, scaled better than these 
structures. 
A couple of interesting data points we found were that our 
TL  algorithm in commit mode scaled, for example, three 
times more than Hanke's algorithm at 16 processors, and 
yet both algorithms had the same throughput. On the red- 
black tree, TL commit mode scaled well both in PO and PS 
mode. 
In conclusion, it is really the relative overheads, as can 
be seen from the single thread performance numbers in Fig- 
ures 2 and 1, that determine which algorithm will perform 
better on a given benchmark. Our TL algorithm in com- 
mit mode is in fact algorithmically very similar to  suggested 
hardware transaction schemes, implying that hardware trans- 
actions "in general" will fail in the same cases that software 
ones fail. Given that hardware transactions will lower the 
overheads of transactional execution, this holds great hope 
that HTM-based mechanically transformed sequential code 
can be as fast, or even faster, than hand-crafted data struc- 
tures. 
3.4 Summarizing the Comparison Among Ap- 
proaches 
Table 1 summarizes our comparison of the different meth- 
ods of constructing lock-based STMs. There are three al- 
gorithmic elements being compared: encounter order lock- 
ing of written locations (ENC) versus commit time locking 
(CMT), per stripe locking (PS) versus per object locking 
(PO), and validation of the read-set on every write (VOW) 
or only before committing (VBC). We compare the different 
methods in terms of the compatibility with the memory life- 
cycle of garbage collected languages like Java, or C programs 
that  use a closed memory pool, versus C programs that use 
only malloc and free style allocation. The table shows which 
techniques work safely only with GC or a closed pool such 
as Fraser's Epoch-based reclamation scheme. The discussion 
based on which these table entries were derived appears in 
Section 2.4. We rank performance using a scale which in- 
cludes very poor, poor, good, better, and best for any given 
category of data structure and load, based on the bench- 
marks presented earlier in this section. We do not show 
entries for the combination of commit time locking (CMT) 
and validation on every write (VOW) since VBC is signifi- 
cantly less costly than VOW and it suffices for commit time 
locking. 
We note that TL uses a versioned write-lock, but if we 
were to instead use a RW lock (with so-called visible read- 
ers) then all the VBC forms ({ENC,CMT) x {PO,PS)) will 
work safely with malloc and free. In addition, RW locks 
don't admit so-called zombie transactions, ones that may 
dereference invalid pointers or enter infinite loops because 
they read an inconsistent state. We decided against RW 
locks early on in our algorithm design because they gen- 
erate excessive cache coherency traffic on traditional SMP 
systems. 
The following is a summary of the findings the table re- 
veals. 
A quick glance at  the table reveals that the perfor- 
mance of VOW schemes is very poor. We based this 
data on benchmarking we performed on Moir's HyTM 
[27] which uses a mechanism similar to ENC/PS/VOW 
in order to allow programmers to freely use malloc and 
free. It is not clear to us at this point how to  catego- 
rize the work of Saha et a1 [31] who use, to the best of 
our understanding, ENC/PS/VBC. They make some 
assumptions on the runtime/memory system that keep 
it closed. 
As can be seen, it would seem that ENC locking is the 
best approach only on large objects using PO lock- 
ing. However, ENC delivers very poor performance 
on small data structures. The CMT locking approach, 
on the other hand, delivers best-of-breed performance 
for all objects and all concurrency levels, and even on 
large uncontended objects when ENC/PO delivers bet- 
ter throughput than CMT/PO. It  would thus be the 
best choice for languages like Java or systems that have 
a closed memory system to  use CMT/PO as provided 
by the TL algorithm. 
I t  would seem that the CMT/PS used in TL is the 
only scheme to  deliver good performance for systems 
in which programmers wish to use malloc and free 
style allocation. ENC/PS/VOW is non-viable because 
of the overhead of the repeated validation. we note 
that she throughput of CMT/PS is not as good as 
CMT/PO (or ENC/PO on large unloaded structures) 
because of the extra cache traffic due to the separate 
lock locations, but is reasonable. 
3.5 Finer Analysis of Overhead 
To better understand what the sources of the overhead in 
the TL design were, we looked at  the single thread perfor- 
mance of our TL algorithm. We note that HTMs attempt 
to cut down the costs of both reads and writes. We wanted 
to  find out what the benefit of using an HTM transaction 
to  acquire all write locks at commit time might be. We con- 
ducted a simple benchmark in which the TL algorithm ran 
on a red-black tree of size 50 with 40% put, 40% delete, and 
20% get operations in single threaded mode, replacing all ex- 
pensive CAS-based lock acquisitions with simple reads and 
writes. We found that in our benchmark with a 1:4 ratio of 
transactional reads to writes, the number of operations per 
second with CAS was 5.2 million and if we converted CAS 
Shared Queue Results entries for the co bination of co it ti e locking ( T)
and validation on every rite ( ) since is signifi-
cantly less costly than and it suffices for co it ti e
locking.
e note that uses a versioned rite-lock, but if e
ere t i stea se a l c ( it s -calle isi le read-
ers) t e all t e for s ({ , } x { , }) ill
r f l it ll fr . I iti , l s
't it s -calle z ie tra sacti s, es t t ay
r f r e i li i t rs r t r i fi it l s ecause
t r i i te t t t . i i st
l s rl i r l rit si se t -
r t ssi r tr ffic tr iti l
s st s.







Si ple TL:C T:PS
Si ple sl _fraser























i r : r t f rr t
. er, ,
i il it t ti t t
), , ,
t t .















. i i i
l ari s
s t s.
rit i l t :
i it l ti s )
( T), t i )
( ), )
r l ).
t i t ti ilit -
c l r ll l li a,
t t s l s l,
l all free t le ll ation.
t i r s f l l i l
s r l ti e.
i t t l tri r
ecti . . e r
cl s r r, r, d, tt , r
cat r ta tr t l ,
r s r rli r i t i tion.
• t t r ls t t t rfor-
f r . ased t i
r ing f r 's
] i il r t /PS/VO
l
i t t t -
f l ] t t st f
, / / BC.
ti e/ e ory t t
• l i i t
i g l













t i l i l it
, ,
t ti i i l t , l i ll
i l i iti it i l
rit . f t t i it : r ti f
t ti l t it , t ti
s it s . illi if rt
(a) Speedup - Small Red-Black Tree 20%R0%/60% (b) Speedup - Large Red-Black Tree 20%R0%/60% 
rnutex 
sp~nlock - .O- 
rncslock 
strn fraser - + - 1 
stmAnnals --..A . 
hanke + 
1 "+ -..... ">*:.:....,I.-: : *  .. . . . . . . . . . . . . . . . . .  ,. . . .  
0 . .. -.-.-.-. - 
0 2 4 6 8 10 12 14 16 
threads 
(c) Small Red-Black Tree 20%R0%160% 
threads 
0 2 4 6  8 10 12 14 16 
threads 
[d) Large Red-Black Tree 20%R0%160% 
threads 
F igure  5: Normalized th roughpu t  graphs  of Red-Black Tree  a n d  below t h e m  t h e  corresponding a b o r t  r a t e s  
for T L / E N C  versus T L / C M T .  A s  can  b e  seen, t h e  dominant  scalability factor is locality of reference ( P O  
versus P S )  a n d  no t  t h e  a b o r t  ra te .  
to  non-atomic reads and writes it yielded 5.8 million oper- 
ations per second, an improvement of .6 million, or about 
10%. Even here it turned out that speeding up lock acqui- 
sition is simply not worth it. 
We then asked ourselves if eliminating the construction 
of a read-set might have a significant effect. We again ran 
red-black tree benchmark but did not construct a read-set 
and made only one pass through the transactional code, as 
would be done by a transaction that had hardware support 
for determining if the read set was consistent. Our transac- 
tional loads still had to look-aside into the write-set. The 
transactional load operation fetched the lock-word and then 
the data. The result was an increase of the total number of 
completed operations to  8.2 million per second. 
4. CONCLUSION 
We presented an evaluation of the factors affecting the 
performance of STM algorithms. Perhaps surprisingly, we 
found that the determining performance factors were the 
LLfixed" costs/overheads associated with STM mechanisms 
(such as read-set validation), and not factors associated with 
scalability (such as transaction abort rates). This led us 
to the design of the transactional locking (TL) algorithm, 
which tries to  minimize these costs. 
5. ACKNOWLEDGMENTS 
We thank Mark Moir and the anonymous Transact'O6 ref- 
erees for many helpful remarks. 
6. REFERENCES 
[I] AFEK, Y., ATTIYA, H., DOLEV, D., GAFNI, E.,  MERRITT, 
M., AND SHAVIT, N. Atomic snapshots of shared memory. 
J. ACM 40, 4 (1993), 873-890. 
[2] AGESEN, O., DETLEFS, D., GARTHWAITE, A., KNIPPEL, R., 
RAMAKRISHNA, Y. S., AND WHITE, D. An efficient 
metalock for implementing ubiquitous synchronization. 
ACM SIGPLAN Notices 34, 10 (1999), 207-222. 
[3] ANANIAN, C. S., ASANOVIC, K., KUSZMAUL, B. C., 
LEISERSON, C. E., AND LIE, S. Unbounded transactional 
memory. In HPCA '05: Proceedings of the 11th 
International Symposium on High-Performance Computer 
Architecture (Washington, DC, USA, 2005), IEEE 
Computer Society, pp. 316-327. 
[4] ANANIAN, C. S., AND RINARD, M. Efficient software 
transactions for object-oriented languages. In Proceedings 
of Synchronization and Concurrency i n  Object-Oriented 
Languages (SCOOL) (2005), ACM. 
[5] BLOOM, B. H. Spaceltime trade-offs in hash coding with 
allowable errors. Commun. ACM 13, 7 (1970), 422-426. 
[6] BOEHM, H.- J .  Space efficient conservative garbage 
collection. In SIGPLAN Conference on Programming 
Language Design and Implementation (1993), pp. 197-206. 













































































1] EK, , TTIYA, , LEV, ., AFNI, ., ERRITT,
, AVlT, t r .
, , 0.
2] ESEN, 0., TLEFS, ., ARTHWAITE, IP EL, ,
AMAKRISHNA, . ., ITE,
ta-l ti r nization.
, , 2.
3] NANIAN, ., SANOVIC, . USZMAUL, .,
EISERSON, IE, .
r . : f
i l
t re i t , , , ,
i t , 7.




] LOOM, . . /
l rs. . , ), .
] EHM, . ti
.
), .
Table 1: Comparison Table 
[7] CORMEN, T .  H., LEISERSON, CHARLES, E., AND RIVEST, 
R.  L. Introduction to Algorithms. MIT Press, 1990. COR 
t h  01:l 1.Ex. 
181 DICE, D. Implementing fast java monitors with 
relaxed-locks. In Java Virtual Machine Research and 
Technology Symposium (2001), USENIX, pp. 79-90. 
[9] ENNALS, R. Software transactional memory should not be 
obstruction-free. www. cambridge. intel-research.net/ 
-rennals/notlockf ree . pdf. 
www.cambridge.inte1-research.net/ rennals/notlockfree.pdf. 
[lo] FRASER, K. Practical lock freedom. PhD thesis, Cambridge 
University Computer Laboratory, 2003. Also available as 
Technical Report UCAM-CL-TR-579. 
[ll] HAMMOND, L., WONG, V., CHEN, M., CARLSTROM, B. D., 
DAVIS, J .  D., HERTZBERG, B., PRABHU, M. K., WIJAYA, 
H. ,  KOZYRAKIS, C., AND OLUKOTUN, K. Transactional 
memory coherence and consistency. In ISCA '04: 
Proceedings of the 31st annual international symposium on 
Computer architecture (Washington, DC, USA, 2004), 
IEEE Computer Society, p. 102. 
(121 HANKE, S. The performance of concurrent red-black tree 
algorithms. In W A E  '99: Proceedings of the 3rd 
International Workshop on Algorithm Engineering 
(London, UK, 1999), Springer-Verlag, pp. 286-300. 
[13] HARRIS, T. ,  AND FRASER, K. Concurrent programming 
without locks. 
[14] HARRIS, T . ,  AND FRASER, K .  Language support for 
lightweight transactions. In Proceedings of the 18th ACM 
SIGPLAN conference on Object-oriented programing, 
systems, languages, and applications (2003), ACM Press, 
pp. 388-402. 
[15] HARRIS, T. ,  AND FRASER, K. Language support for 
lightweight transactions. SIGPLA N Not. 38, 11 (2003), 
388-402. 
[16] HERLIHY, M. The SXM software package, 
http://www.cs.brown.edu/~ mph/sxm/readme.doc. 
[17] HERLIHY, M., LUCHANGCO, V. AND MOIR, M. 
Obstruction-free software transactional memory for 
supporting dynamic data structures. Tech. Rep. Technical 
Report, Sun Microsystems, May 2002. 
[la] HERLIHY, M., LUCHANGCO, V., MOIR, M., AND SCHERER, 
111, W. N. Software transactional memory for 
dynamic-sized data structures. In Proceedings of the 
twenty-second annual symposium on Principles of 
distributed computing (2003), ACM Press, pp. 92-101. 
[19] HERLIHY, M., AND MOSS, E. Transactional memory: 
Architectural support for lock-free data structures. In 
Proceedings of the Twentieth Annual International 
Symposium on Computer Architecture (1993). 
[20] KRUSKAL, C., RUDOLPH, L., AND SNIR, M. Efficient 
synchronization of multiprocessors with shared memory. 
ACM Transactions on Programming Languages and 
Systems 10, 4 (1988), 579-601. 
[21] KUMAR, S., CHU, M., HUGHES, C., KUNDU, P . ,  AND 
NGUYEN, A. Hybrid transactional memory. In To appear i n  
PPoPP 2006 (2006). 
[22] MARATHE, V. J . ,  SCHERER, W. N., AND SCOTT, M. L. 
Design tradeoffs in modern software transactional memory 
systems. In In  Proceedings of the 7th Workshop on 













Languages, Compilers, and Run-time Support for Scalable 
Systems (LCR '04) (2004). 
[23] MARATHE, V. J . ,  SCHERER, W. N., AND SCOTT, M. L. 
Adaptive software transactional memory. In To Appear in  
the Proceedings of the 19th International Symposium on 
Distributed Computing (DISC'O5) (2005). 
[24] MELLOR-CRUMMEY, J . ,  AND SCOTT, M. Algorithms for 
scalable synchronization on shared-memory 
multiprocessors. ACM Transactions on Computer Systems 
9, 1 (1991), 21-65. 
[25] MICHAEL, M. M. Hazard pointers: Safe memory 
reclamation for lock-free objects. IEEE Trans. Parallel 
Distrib. Syst. 15, 6 (2004), 491-504. 
[26] MICHAEL, M. M., AND SCOTT, M. L. Simple, fast, and 
practical non-blocking and blocking concurrent queue 
algorithms. In Symposium on Principles of Distributed 
Computing (1996), pp. 267-275. 
[27] MOIR, M. HybridTM: Integrating hardware and software 
transactional memory. Tech. Rep. Archivist 2004-0661, Sun 
Microsystems Research, August 2004. 
[28] PUGH, W. A skip list cookbook. Tech. rep., College Park, 
MD, USA, 1990. 
[29] RAJWAR, R. Transactional memory online, 
http://www.cs.wisc.edu/trans-memory. 
[30] RAJWAR, R., HERLIHY, M., AND LAI, K. Virtualizing 
transactional memory. In ISCA '05: Proceedings of the 
32nd Annual International Symposium on Computer 
Architecture (Washington, DC, USA, 2005), IEEE 
Computer Society, pp. 494-505. 
[31] SAHA, B., ADL-TABATABAI, .-R., HUDSON, R. L., MINH, 
C. C., AND HERTZBERG, B. A high performance software 
transactional memory system for a multi-core runtime. In 
To appear in  PPoPP 2006 (2006). 
1321 SHALEV, O., AND SHAVIT, N. Predictive 
log-synchronization. In EuroSys 2006, to appear (2006). 
1331 SHAVIT. N.. AND TOUITOU. D. Software transactional , . , , 
memory. Distributed Computing 10, 2 (February 1997), 
99-116. 
[34] WELC, A., JAGANNATHAN, S.,  AND HOSKING, A. L. 
Transactional monitors for concurrent objects. In 
Proceedings of the European Conference on 
Object-Oriented Programming (2004), vol. 3086 of Lecture 
Notes in  Computer Science, Springer-Verlag, pp. 519-542. 










































J RMEN, . , ISERSON, ARLES, , IVEST,







~ enn s/notlockf e pd .
w.cambridge.intel-r n t/re l .
10] ASER, . l
.
11] AMMOND, NG, , EN, ., RLSTROM, .,
VIS, , RTZBERG, , ABHU, , IJAYA,









3J RRIS, , ASER,
.










7J RLIHY, ., CHANGCO, IR, .
.





9] RLIHY, ., oss,
f l l
t re 3).
0] USKAL, , DOLPH, , IR, t
f .
, ), -60l.
1] MAR, ., U, , GHES, , NDU, .,
UYEN, .
).
2] RATHE, , HERER, , OT , . .
. f t p
il ,
) ).
] RATHE, . , HERER, , OTT,
r .
f l
t ' S ).
] LLOR-CRUMMEY, ., OT ,
, ), 5.
5] ICHAEL, . .
s. l
, .






8] GH, . .,
, 0.
9] JWAR, .
:// ww.cs.wisc.edu/trans- e ory.
] JWAR, , RLIHY, I, .
. : f
l ti l
t re , ),
i t , .




[ J ALEV, 0., AVIT,
, ).
[ ] AVIT, , UITOU,
r . t , ),
.





Debugging with Transactional Memory 
Yossi Lev 
Brown University & Sun Microsystems Laboratories 
Mark Moir 
Sun Microsystems Laboratories 
ABSTRACT 
Transactional programming promises to  substantially 
simplify the development of correct, scalable, and ef- 
ficient concurrent programs. Designs for supporting 
transactional programming using transactional memory 
implemented in hardware, software, and a mixture of 
the two have emerged recently. To our knowledge, no- 
body has yet addressed issues involved with debugging 
programs executed using transactional memory. 
Because transactional memory implementations pro- 
vide the "illusion" of multiple memory locations chang- 
ing value atomically, while in fact they do not, there 
are challenges involved with integrating debuggers with 
such programs to  provide the user with a coherent view 
of program execution. This paper shows how to over- 
come these problems by making the debugger interact 
with transactional memory implementations in a mean- 
ingful way. In addition to describing how LLstandard" 
debugging functionality can be integrated with transac- 
tional memory implementations, we also describe some 
powerful new debugging mechanisms that are enabled 
by transactional memory infrastructure. Our descrip 
tion focuses on how to  enable debugging in software and 
hybrid software-hardware transactional memory systems. 
1. INTRODUCTION 
In concurrent software it is often important to  guar- 
antee that one thread cannot observe partial results of 
an operation being executed by another thread. These 
guarantees are necessary for practical and productive 
software development because, without them, it is ex- 
tremely difficult to reason about the interactions of con- 
current threads. In today's software practice, these 
guarantees are almost always provided by using locks to 
prevent other threads from accessing the data affected 
by an ongoing operation. Such use of locks gives rise 
to  a number of well known problems, both in terms of 
software engineering and in terms of performance. 
Transactional memory (TM) [7, 161 allows the pro- 
grammer to think as if multiple memory locations can 
be accessed and/or modified in a single atomic step. 
Thus, in many cases, it is possible to complete an op- 
(c) Sun Microsystems, Inc. 2006. All r ights re- 
served.  
eration with no possibility of another thread observing 
partial results, even without holding any locks. This sig- 
nificantly simplifies the design of concurrent programs. 
Transactional memory can be implemented in hard- 
ware 171, with the hardware directly ensuring that a 
transaction is atomic, or in software [16] that provides 
the "illusion" that the transaction is atomic, even though 
in fact it is executed in smaller atomic steps by the un- 
derlying hardware. Substantial progress has been made 
in making software transactional memory (STM) prac- 
tical recently [2, 3, 6, 101. Nonetheless, there is a grow- 
ing consensus that a t  least some hardware support for 
transactional memory is desirable, and several proposals 
for supporting TM in hardware have emerged recently 
11, 4, 131. All existing proposals for implementing TM 
in hardware either impose severe limitations on pro- 
grammers or  are too complicated and inflexible to be 
considered in the near future, and also leave a number 
of issues unresolved. To address this situation, we have 
proposed Hybrid TM (HyTM) [l:l], which provides a 
fully functional STM implementation that can exploit 
best-eflort HTM support to boost performance if it is 
available and when it is effective. Kumar et. a1 181 have 
L 2 
recently made a similar proposal. 
To our knowledge, none of the TM designs (HTM, 
STM, or HyTM) proposed to date addresses the issue of 
debugging programs that use them. While TM promises 
to substantially simplify the development of correct con- 
current programs, programmers will still need to debug 
code while it is under development, and therefore it is 
crucial that we develop robust TM-compatible debug- 
ging mechanisms. 
Debugging poses challenges for all forms of TM. If 
HTM is to  provide support for debugging, it will be 
even more complicated than current proposals. STM 
on the other hand ~rovides  the L'illusion" that trans- 
actions are executed atomically, while in fact they are 
implemented by a series of smaller steps. If a standard 
debugger were used with an STM implementation, it 
would expose this illusion, creating significant confu- 
sion for programmers. HyTM is potentially susceptible 
to both problems. In this paper, we describe a series 
of mechanisms for supporting debugging in STM and 
HyTM systems. In keeping with the HyTM philosophy, 
we do not impose any requirement on HTM support for 
debugging. 
For concreteness we describe the debugging techniques 
i l
t s r t ri s
i
































in the context of a simple word-based HyTM system, 
such as described in [ll]. In Section 2 we give a brief 
overview of this HyTM system. In Section 3, we de- 
scribe several debug modes which will aid in the descrip 
tion of our debugging techniques. Section 4 presents 
debugging techniques in the following topics: 
a Breakpoints in atomic blocks. 





2. A WORD-BASED HYTM SCHEME 
2.1 Overview 
The HyTM system [ll] comprises a compiler, a li- 
brary for supporting transactions in software, and (op- 
tionally) HTM support. Programmers express blocks of 
code that should (appear to) be executed atomically in 
some language-specific notation. For concreteness, we 
assume the following simple notation: 
atomic { 
. . . 
code t o  be executed atomically 
For each such atomic block, the compiler produces 
code to execute the code block atomically using trans- 
actional support. A typical HyTM approach is to pro- 
duce code that attempts to execute the block one or 
more times using HTM, and if that  does not succeed, 
to repeatedly attempt to do so using the STM library. 
The compiler also produces "glue" code that hides 
this retrying from the programmer, and invokes "con- 
tention management" mechanisms [6, 151 when neces- 
sary to facilitate progress. Such contention manage- 
ment mechanisms may be implemented, for example, 
using special methods in the HyTM software library. 
These methods may make decisions such as whether a 
transaction that encounters a potential conflict with a 
concurrent transaction should a) abort itself, b) abort 
the other transaction, or c) wait for a short time to 
give the other transaction an opportunity to complete. 
As we will see, debuggers may need to interact with 
contention control mechanisms to provide a meaningful 
experience for users. 
Because the above-described approach may result in 
the concurrent execution of transactions in hardware 
and in software, we must ensure correct interaction of 
these transactions. The HyTM approach is to have the 
compiler emit additional code in the hardware trans- 
action that looks up structures maintained by software 
transactions in order to detect any potential conflict. 
In case such a conflict is detected, the hardware trans- 
action is aborted, and is subsequently retried, either in 
hardware or in software. Below we explain how software 
transactions provide the illusion of atomicity, and how 
hardware transactions are augmented to detect poten- 
tial conflicts with software ones. 
2.2 Transactional Execution 
As a software transaction executes, it acquires "own- 
ership" of each memory location that it accesses: ex- 
clusive ownership in the case of locations modified, and 
possibly shared ownership in the case of locations read 
but not modified. This ownership cannot be revoked 
while the owning transaction is in the ac t ive  state: A 
second transaction that wishes to acquire exclusive own- 
ership of a location already owned by the first trans- 
action must first abort the transaction by changing its 
status to aborted. Furthermore, a location can be mod- 
ified only by a transaction that owns it. However, rather 
than modifying the locations directly while executing, 
the transaction LLb~ffers'' its modifications in a "write 
set". Thus, if a transaction reaches its end without be- 
ing aborted, then all of the locations it accessed have 
maintained the same values since they were first ac- 
cessed. The transaction atomically switches its status 
from act ive  to committed, thereby logically applying 
the changes in its write set to the respective memory 
locations it accessed. Before releasing ownership of the 
modified locations, the transaction copies back the val- 
ues from its write set to the respective memory locations 
so that subsequent transactions acquiring ownership of 
these locations see the new values. 
2.3 Ownership 
In the word-based HyTM scheme described here, there 
is an ownership record (henceforth orec) associated with 
each transactional location (i.e., each memory location 
that can be accessed by a transaction). To avoid the ex- 
cessive space overhead that would result from dedicating 
one orec to each transactional location, we instead use 
a special orec table. Each transactional location maps 
to one orec in the orec table, but multiple locations 
can map to the same orec. To acquire ownership of a 
transactional location, a transaction acquires the cor- 
responding orec in the orec table. The details of how 
ownership is represented and maintained are mostly ir- 
relevant here. We do note, however, that the orec con- 
tains an indication of whether it is owned, and if so 
whether in "read" or "write" mode. These indications 
are the key to how hardware transactions are augmented 
to detect conflicts with software ones. For each memory 
access in an atomic block to be executed by a hardware 
transaction, the compiler emits additional code for the 
hardware transaction to lookup the corresponding orec 
and determine whether there is (potentially) a conflict- 
ing software transaction. If so, the hardware transac- 
tion simply aborts itself. By storing an indication of 
whether the orec is owned in read or write mode, we 
allow a hardware transaction to succeed even if it ac- 
cesses one or more memory locations in common with 
one or more concurrent software transactions, provided 










































As described above, the illusion of atomicity is pro- 
vided by considering the updates made by a transaction 
t o  "logically" take effect at  the point at  which it com- 
mits, known as the transaction's linearization point [5 ] .  
By preventing transactions from observing the values of 
transactional locations that they do not own, we hide 
the reality that the changes to  these locations are in 
fact made one by one after the transaction has already 
committed. 
If we use such an STM or HyTM package with a stan- 
dard debugger, the debugger will not respect these own- 
ership rules. Therefore, for example, it might display 
a pre-transaction value in one memory location and a 
post-transaction value in another location that is up- 
dated by the same transaction. This would "break" the 
illusion of atomicity, which would severely undermine 
the user's ability to  reason about the program. 
Furthermore, a standard debugger would not deal in 
meaningful ways with the multiple code paths used to  
execute transactions in hardware and in software, or 
library calls for supporting software transactions, con- 
tention management, etc. In this paper, we explain how 
to address all of these issues. We also explain how the 
infrastructure for STM and HyTM can support some 
powerful new debugging mechanisms. 
3. DEBUG MODES 
In this document we will distinguish between three 
basic debug modes: 
Unsynchronized Debugging: In this mode, when a 
thread stops (when hitting a breakpoint, for ex- 
ample), the rest of the threads keep running. 
Synchronized Debugging: if a thread stops the rest 
of the threads also stop with it. There are two 
synchronized debugging modes: 
- Concurrent Stepping: In this mode, when the 
user asks the debugger to run one step of a 
thread, the rest of the threads also run while 
this step is executed (and stop again when the 
step is completed, as this is a synchronized 
debugging mode). 
- Isolated Stepping: In this mode, when the 
user asks the debugger to run one step of a 
thread, only that thread's step is executed. 
For simplicity, we assume that the debugger is at- 
tached t o  only one thread a t  a time, which we denote 
as the debugged thread. If the debugged thread is in 
the middle of executing a transaction, we denote this 
transaction as the debugged transaction. When a thread 
stops a t  a breakpoint, it automatically becomes the de- 
bugged thread. Note that with the synchronized d e  
bugging modes, after hitting a breakpoint the user can 
choose to change the debugged thread, by switching to 
debug another thread. 
4. DEBUGGING TECHNIQUES 
4.1 Breakpoints in Atomic Blocks 
The ability to  stop the execution of a program on a 
breakpoint and to  run a thread step by step is a funda- 
mental feature of any debugger. In a transactional pro- 
gram, a breakpoint will sometimes reside in an atomic 
block. In this section we describe a technique that en- 
ables the debugger to stop and step through such a block 
in the HyTM system, wherein an atomic block may have 
at  least two implementations, for example, one that uses 
HTM and another that uses STM. 
In keeping with the HyTM philosophy, we do not as- 
sume that any special debugging capability is provided 
by the HTM support. Therefore, if the user sets a 
breakpoint inside an atomic block, in order to  debug 
that atomic block, we must disable the code path that 
attempts t o  execute this particular atomic block using 
HTM! thereby forcing it to  be executed using STM. If 
we cannot determine whether a given atomic block con- 
tains a breakpoint (for example, in the presence of indi- 
rect function calls), we can simply abort the executing 
hardware transaction when it reaches the breakpoint, 
eventually causing the atomic block to  be executed by 
a software transaction. 
One way t o  disable the HTM code path is to modify 
the code for the transaction so that it branches uncon- 
ditionally to  the software path, rather than attempting 
the hardware transaction. In HyTM schemes in which 
the decision about whether to  try to execute a transac- 
tion in hardware or in software is made by a method in 
the software library, the code can be modified to  omit 
this call and branch directly to the software path. An 
alternative approach is to  provide the debugger with an 
interface to  the software library so that it can instruct 
the software method to  always choose the software path 
for a given atomic block. 
In addition to  disabling the hardware path, we must 
also enable the breakpoint in the software path. This 
is achieved mostly in the same way that breakpoints 
are achieved in standard debuggers. However, there are 
some issues to note. 
First, the correspondence between the source code 
and the STM-based implementation of an atomic block 
differs from the usual correspondence between source 
and assembly code: the STM-based implementation uses 
the STM library functions for read and write operations 
in the block, and may also use other function calls to  
correctly manage the atomic block execution. For exam- 
ple, it is sometimes necessary t o  invoke the STM library 
method STM-Validate in order to verify that the values 
read by the transaction so far represent a consistent 
state of the memory. Figure 1 shows an example of an 
STM-based implementation of a simple atomic block. 
The debug information generated by the compiler should 
reflect this special correspondence to support a mean- 
ingful debugging view to users. When the user is step- 
ping in source-level mode, all of these details will be 
hidden, just as assembly-level instructions are hidden 
from the user when debugging in sourcelevel mode with 
' w e  do not want to disable all use of HTM in the pro- 
gram, because we wish to minimize the impact on pro- 
gram timing in order to avoid masking bugs. 
s s ri , t ill si f t i it is r -
i si ri t t s tr s ti
t "l icall " ta e effect t t e i t t ic it c -
its, as t tr s ti 's li e riz ti i t [5].
r ti tr ti fr r i t l f
tr ti l l ti t t t t , i
t r lit t t t t t l ti r i
t t t t ti l
itt .
I it t
r, ill t t t
r i . , l ,































t r t r .
.
. i t i t i l
e ability t st t e execution f a r ra a
rea i t a t r a t rea ste ste is a f a-
t l f t r f r. I tr ti l r -
r , r i t ill s ti s r si i t i
l . I t is s ti s ri t i t t -
l s t r t st st t r s l
i t s st , r i t i l
t l st t i l t ti s, f r l , t t s s
t .
I i it t il , t -
t t i l i ilit i r i
t rt. r f r , if t r t
,
t i l , t i l t t t t
tt t t t t i ti l r t i l i
,l t i t t i . f
r l





















I W t t t i l ll f i t
r , i t i i i t i t r -
ti i i t i i .
ferring control for retry or commit, and because most atomic { 
v = node->next->value; 
I 
while ( t rue)  { 
t i d  = STM-begin-trano; 
tmp = STM-read(tid, &node) ; 
i f  (STM-Validate ( t i d )  ) { 
tmp = STM-read(tid, &(tmp->next) ) ; 
i f  (sTM-Validate ( t i d l  ) { 
tmp2 = STM-read(tid, &(tmp->value)); 
 write ( t i d ,  &v, tmp2) ; 
1 
1 
i f  (STM-commit-tran(tid1) break; 
1 
Figure 1: A n  example of a n  a tomic  block a n d  i t s  
STM-based implementation. 
a standard debugger. However, when the user is step- 
ping in assembly-level mode, all STM function calls are 
visible to the user, but should be regarded as atomic as- 
sembly operations: stepping into these functions should 
not be allowed. 
Another issue is that control may return to the begin- 
ning of an atomic block if the transaction implementing 
it is aborted. Without special care, this may be con- 
fusing for the user: it will look like "a step backward". 
In particular, in response to the user asking to execute 
a single step in the middle of an atomic block, control 
may be transferred to the beginning of the atomic block 
(which might reside in a different function or file). In 
such cases the debugger may prompt the user with a 
message indicating that the atomic block execution has 
been restarted due to an aborted transaction. 
Finally, it might be desirable for the debugger to call 
STM-Validate right after it hits a breakpoint, to ver- 
ify that the transaction can still commit successfully. 
This is because, with some HyTM implementations, a 
transaction might continue executing even after it has 
encountered a conflict that will prevent it from com- 
mitting successfully. While the HyTM must prevent 
incorrect behavior (such as dereferencing a null pointer 
or dividing by zero) in such cases, it does not necessarily 
prevent a code path from being taken that would not 
have been taken if the transaction were still "viable". 
In such cases, it is probably not useful for the user to 
believe that such a code path was taken, as the transac- 
tion will fail and be retried anyway. The debugger can 
avoid such "false positives" by calling STM-Validate af- 
ter hitting the breakpoint, and ignore the breakpoint if 
the transaction is no longer viable. 
The debugger may also provide a feature that allows 
the user to abort the debugged transaction, with the op- 
tion to either retry it from the beginning, or perhaps to 
skip it altogether and resume execution after the atomic 
block. Such functionality is straightforward to provide 
because the compiler already includes code for trans- 
TM implementations provide means for a transaction to 
explicitly abort itself. 
4 .  I .  I Contention Manager Support 
When stepping through an atomic block, it might 
be useful to change the way in which conflicts are re- 
solved between transactions, for example by making the 
debugged transaction win any conflict it might have 
with other transactions. We call such a transaction a 
super-transaction. This feature is crucial for the iso- 
lated stepping synchronized debugging mode because 
the debugged thread takes steps while the rest of the 
threads are not executing, and therefore there is no 
point in waiting in case of a conflict with another thread, 
nor in aborting the debugged transaction. It may also 
be useful in other debugging modes, because it will 
avoid the debugged transaction being aborted, causing 
the "backward-step" phenomenon previously described. 
This is especially important because the debugged trans- 
action will probably run much slower than other trans- 
actions, and therefore is more likely to be aborted. 
In some STM and HyTM implementations, particu- 
larly those supporting read sharing, orecs indicate only 
that they are owned in read mode, and do not indi- 
cate which transactions own them in that mode (with 
these implementations, transactions record which loca- 
tions they have read, and recheck the orecs of all such 
locations before committing to ensure that none has 
changed). Supporting the super-transaction with these 
implementations might seem problematic, since when a 
transaction would like to get write ownership on an orec 
currently owned in read mode, it needs to know whether 
one of readers owning this orec is a super-transaction. 
One simple solution is to specially mark the orecs of 
all locations read so far by the debugged transaction 
upon hitting a breakpoint, and to continue marking 
orecs newly acquired in read mode as the transaction 
proceeds. The STM library and/or its contention man- 
ager component would then ensure that a transaction 
never acquires write ownership of an orec that is cur- 
rently owned by the super-transaction. 
4.1.2 Switching between Debugged Threads 
When stopping at a breakpoint, the thread that hit 
that breakpoint automatically becomes the debugged 
thread. In some cases though, the user would like to 
switch to debug another thread after the debugger has 
stopped on the breakpoint. This is particularly useful 
when using the isolated steps synchronized debugging 
mode, because in this case the user has total control over 
all the threads, and can therefore simulate complicated 
scenarios of interaction between the threads by taking 
a few steps with each thread separately. 
There are a few issues to consider when switching 
between debugged threads. The first has to do with 
hardware transactions when using HyTM: it might be 
that the new debugged thread is in the middle of ex- 
ecuting the HTM-based implementation of an atomic 
block. Depending on the HTM implementation, attach- 
ing the debugger to such a thread may cause the hard- 







. t - t) ;
S -Validate(ti ))
(t p->value) ;





: l t i ts
. ever, t -
bly-level , ll ll
i i l r, t i -
l r ti s: ti s l
t ll ed.
t i s e is t tr l t t t i -
i t i l i t t i l
it is rt . it t cial re, t is -
f si g f r t s r: it ill l li t ard".
I rti l r, i r t t i t t
i l t i t i l t ic l , tr l
t t t i t t ic l
( i i r i i iff r t f cti file). I
ses t t it
s i i ti t t t t ic l
t t rt t s ti .
i ll , it i t i f r t r t ll
- ali t ri ft it it r a point, t r-
if t t t t till i ccessful y.
is is , it i l entati s,
t t i ti ti ft r it
t flict t t ill it fr -
itt ccessful y. il t t
i ( s r f i ll i
r i i i ro) i ses, it es t ril
t fr i t t l t
t i t t t r till i l ".
I s cases, it is r l t seful f r t e s r to
li t t s a c t as t en, as t e tr
ti ill fail r tri a y ay. e e r
a i s "false siti es" calli - ali t f-
t r itti t r a point, i re t r i t i
t e tr ti is l er i le.
e e r als r i a feature t t all s
t e s r t rt t e e e tr s ti , it t p-
ti to eit r r tr it fr t e inning, r r s to
s ip it alt t r e ec ti fter t e t ic
l ck. c f nctionality is str i t t r i




t i t i l , i
i fli ts -
l ti s, r l
fli t i
ti s. ll




fli t t ,
ti . l










l t ti , ti i -
ti t d, t r ll
l itti t r t
ged). rti t r-tr ti it t
i l t ti i t l ti , i
t ti l li e t t rit r i r
r i , it t t
r i t i r is er-transaction.
i l l ti i t i t r
ll l ti s f r t t
i r i t, t t
r s l i i s t t t
r . li it t ti
r t l t r t t t t
i rit rs i r t t is r-
t t r-tra saction.
.1.2 it i reads
t i t r int, t t t it
t i t ati all t
t r . I s s t , t l li e t
it t t t ft t r
t t is is t f l
t i t s r i i
ode, i t is ase t e s r s t t l tr l er
ll t e t r a s, t ref re si ulate li
s ri i t t t t r s t i
fe ste s it t r se arately.
ere re fe iss es t si r s it i
t t r a s. e first s to o it
r tr s ti s si g : it i t
t t t e t r is i t e i le ex-
ti t - i l e tati t ic
lock. e e i t e i le entation, ttach-
i t e r to s a t r c s t r -
ar tr s ti to ort. oreover, s is
not assumed to provide any specific support for debug- 
ging, we will often want to abort the hardware transac- 
tion anyway, and restart the atomic block's execution 
using the STM-based implementation. 
Again, depending on the HTM support available, var- 
ious alternatives may be available, and it may be useful 
to allow users to choose between such alternatives, ei- 
ther through configuration settings, or each time the 
decision is to be made. Possible actions include: 
1. Switch to the new thread aborting its transaction 
2. Switch to the new thread but only after it has 
completed (successfully or otherwise) the transac- 
tion (this might be implemented for example by 
appropriate placement of additional breakpoints). 
3. Cancel and stay with the old debugged thread. 
Another issue to consider is the combination of the 
super-transaction feature and the ability to  switch the 
debugged thread. Generally it makes sense to  have only 
one super-transaction a t  a time. If the user switches 
between threads, it is probably desirable to change the 
previously debugged transaction back to  be a regular 
transaction, and make the new debugged transaction a 
super-transaction. As described above, this may require 
unmarking all orecs owned in read mode by the old de- 
bugged transaction, and marking those of the new one. 
4.2 Viewing and Modifying Variables 
Another fundamental feature supported by all debug- 
gers is the ability to view and modify variables when 
the debugger stops execution of the program. The user 
provides a variable name or a memory address, and the 
debugger displays the value stored there and may also 
allow the user to change this value. As explained ear- 
lier, in various TM implementations, particularly those 
based on STM or HyTM approaches, the current logical 
value of the address or variable may differ from the value 
stored in it. In such cases, the debugger cannot deter- 
mine a variable's value by simply reading the value of 
the variable from memory. The situation is even worse 
with value modifications: in this case, simply writing 
a new value to the specified variable may violate the 
atomicity of transactions currently accessing it. In this 
section we explain how the debugger can view and mod- 
ify data in a TM-based system despite these challenges. 
The key idea is to access variables that may be ac- 
cessed by transactions using the TM implementation, 
rather than directly, in order to  avoid the above-described 
problems. However, there are several important issues 
to consider in deciding whether to access a variable us- 
ing a transaction, and if so, with which transaction. 
First, the debugged program may contain transac- 
tional variables that should be accessed using TM and 
nontransactional variables that can be accessed directly 
using conventional techniques. A variety of techniques 
for distinguishing these variables exist, including type- 
based rules enforced by the compiler, as well as dynamic 
techniques that determine and possibly change the sta- 
tus of a variable (transactional or nontransactional) at  
runtime (for example, [9]). Whichever technique is used 
in a particular system, the debugger must be designed 
to take the technique into account and access variables 
using the appropriate method. In particular, the de- 
bugger should always use transactions to access trans- 
actional variables, and nontransactional variables can 
be accessed as in a standard debugger.2 
For transactional variables, one option is for the de- 
bugger to get or set the variable value by executing 
a "mini-transaction"-that is, a transaction that con- 
sists of the single variable access. The mini-transaction 
might be executed as a hardware transaction or as a 
software transaction, or it may follow the HyTM ap- 
proach of attempting to execute it in hardware, but 
retrying as a software transaction if the hardware trans- 
action fails to commit or detects a conflict with a soft- 
ware transaction. 
If, however, the debugger has stopped in the mid- 
dle of an atomic block execution, and the variable to 
be accessed has already been accessed by the debugged 
transaction, then it is often desirable to access the spec- 
ified variable from the debugged transaction's "point of 
view". For example, if the debugged transaction has 
written a value to the variable, then the user may de- 
sire to see the value it has stored, even though the trans- 
action has not yet committed, and therefore this value 
is not (yet) the value of the variable being examined. 
Similarly, if the user requests to modify the value of a 
variable that has been accessed by the debugged trans- 
action, then it may be desirable for this modification to 
be part of the effect of the transaction when it commits. 
To support this behavior, the variable can be accessed in 
the context of the debugged transaction simply by call- 
ing the appropriate library function. (We note that it is 
straightforward to extend existing HyTM and STM im- 
plementations to support functionality that determines 
whether a particular variable has been modified by a 
particular transaction.) 
Note that it is still better to  access variables that 
were not accessed by the debugged transaction using 
mini-transactions and not the debugged transaction it- 
self. This is because accessing such variables using the 
debugged transaction increases the set of locations that 
the transaction is accessing, thereby making it more 
likely to abort due to a conflict with another transac- 
tion. 
In general, it is preferable that actions of the debug- 
ger have minimal impact on normal program execution. 
For example, we would prefer to  avoid aborting trans- 
actions of the debugged program in order to display 
values of variables to the user. However, we must pre- 
serve the atomicity of program transactions. In some 
cases, it may be necessary to abort a program transac- 
tion in order to service the user's request. For example, 
if the user requests to modify a value that has been 
accessed by an existing program transaction, then the 
mini-transaction used to effect this modification may 
conflict with that program transaction. Furthermore, 
'In some TM systems, accessing a nontransactional 
variable using a transaction will not result in incorrect 
behavior, in which case we can choose to access all vari- 












































some STM and HyTM implementations are susceptible 
to false conflicts in which two transactions conflict even 
though they do not access any variables in common. 
In case the mini-transaction used to implement a user 
request does conflict with a program transaction, sev- 
eral alternatives are possible. We might choose either 
to abort the program transaction, or to wait for it to 
complete (in appropriate debugging modes), or to aban- 
don the attempted modification. These choices may be 
controlled by preferences configured by the user, or by 
prompting the user to decide between them when the 
situation arises. In the latter case, various information 
may be provided to the user, such as which program 
transaction is involved, what variable is causing the con- 
flict (or an indication that it is a false conflict), etc. 
In some cases, the STM may provide special-purpose 
methods for supporting mini-transactions for debugging. 
For example, if all threads are stopped, then the debug- 
ger can modify a variable that is not being accessed 
by any transaction without acquiring ownership of its 
associated orec. Therefore in this case, if the STM im- 
plementation can tell the debugger whether a given vari- 
able is being accessed by a transaction, then the debug- 
ger can avoid acquiring ownership and aborting another 
transaction due to a false conflict. 
4.2.1 Adding and Removing a Variable from the 
Transaction's Access Set 
As described in the previous section, it is often prefer- 
able to access variables that do not conflict with the de- 
bugged transaction using independent mini-transactions. 
In some cases, however, it may be useful to allow the 
user to access a variable as part of the debugged trans- 
action even if the transaction did not previously access 
that variable. This way, the transaction would com- 
mit only if the variable viewed does not change before 
the transaction attempts to commit, and any modifica- 
tions requested by the user would commit only if the 
debugged transaction commits. This approach provides 
the user with the ability to L'augment" the transaction 
with additional memory locations. 
Moreover, some TM implementations support early- 
release functionality [6]: with early-release, the pro- 
grammer can decide to discard any previous accesses 
done to a variable by the transaction, thereby avoiding 
subsequent conflicts with other transactions that mod- 
ify the released variable. If early-release is supported by 
the TM implementation, the debugger can also support 
removing a variable from the debugged-transaction's ac- 
cess set. 
4.2.2 Displaying the pre-transaction value of the 
debugged transaction 
Although when debugging an atomic block the user 
would usually prefer to see variables as they would be 
seen by the debugged transaction, in some cases it might 
be useful to see the value as it was before the transac- 
tion began (note that since the debugged transaction 
has not committed yet, this pre-transaction value is the 
current logical value of the variable, as may be seen by 
other threads). Some STM implementations can easily 
provide such functionality because they record the value 
of all variables accessed by a transaction the first time 
they are accessed. In other STM implementations, the 
pre-transaction value is kept in the variable itself until 
the transaction commits, and can thus be read directly 
from the variable. In such systems, the debugger can 
display the pre-transaction value of a variable (as well 
as the regular value seen by the debugged transaction). 
4.2.3 Getting valuesfrom conflicting transactions 
In some cases, it is possible to determine the logical 
value of a variable even if it is currently being modi- 
fied by another transaction. As described above, it may 
be possible for the debugger to get the pre-transaction 
value of a variable accessed by a transaction. If the de- 
bugger can determine that the conflicting transaction's 
linearization point has not passed, then it can display 
the pre-transaction value to the user. How such a deter- 
mination can be made depends on the particular STM 
implementation, but in many cases this is not difficult. 
Another potentially useful piece of information we can 
get from the transaction that owns the variable the user 
is trying to view is the tentative value of that variable- 
that is, the value as seen by the transaction that owns 
the variable. Specifically, the debugger can inform the 
user that the variable is currently accessed by a software 
transaction, and give the user both the current logical 
value of the variable (that is, its pre-transaction value), 
and its tentative value (which will be the the variable's 
value when and if the transaction commits successfully). 
4.3 Atomic Snapshots 
The debugger can allow the user to define an atomic 
group of variables to be read and/or modified atom- 
ically. Such a feature provides a powerful debugging 
capability that is not available in standard debuggers: 
the ability to get a consistent view of multiple vari- 
ables even in unsynchronized debug mode, when threads 
are running and potentially modifying these variables. 
(It can also be used with synchronized debugging when 
combined with the delayed breakpoint feature; see Sec- 
tion 4.5.) 
Implementing atomic groups using TM is simply done 
by accessing all variables in the group using one transac- 
tion. The variables in the group are read using a single 
transaction. As for modifications, when the user modi- 
fies a variable in an atomic group, the modification does 
not take effect until the user asks to commit all modifi- 
cations to the group, at which point the debugger begins 
a transaction that executes these modifications atomi- 
cally. The transactions can be managed by HTM, STM 
or HyTM. 
Note that the displayed values of the group's vari- 
ables may not be their true value at the point the user 
tries to modify them. We can extend this feature with 
a compare-and-swap option, which modifies the values 
of the group's variables only if they contain the previ- 
ously displayed values. This can be done by beginning 
a transaction that first rereads all the group's variables 
and compares them to the previously presented values 
(saved by the debugger), and only if these values all 
match, applies the modifications using the same trans- 
































































can be displayed. 
Finally, the debugger may use a similar approach when 
displaying a compound structure, to guarantee that it 
displays a consistent view of that structure. Suppose, 
for example, that the user views a linked list, starting 
a t  the head node and expanding it node-by-node. Be- 
cause in unsynchronized debugging mode the list might 
change while being viewed, reading it node-by-node might 
display an inconsistent view of the list. The debugger 
can use a transaction to  reread the nodes leading to 
the node the user has just expanded, thereby avoiding 
such inconsistency. 
4.4 Watchpoints 
Many debuggers support watchpoint functionality, al- 
lowing a user to instruct the debugger to  stop when 
a particular memory location or variable is modified. 
More sophisticated watchpoints, called conditional watch- 
points, can also specify that the debugger should stop 
only when a certain predicate holds (for example, that 
the variable value is bigger than some number). 
Watchpoints are sometimes implemented using spe- 
cific hardware support, called hw-breakpoints. If no hw- 
breakpoint support is available, some debuggers imple- 
ment watchpoints in software, by executing the program 
step-by-step and checking the value of the watched vari- 
able(~)  after each step, which results in executing the 
program hundreds of times slower than normal. 
We describe here how to exploit TM infrastructure 
to  stop on any modification or even a read access to 
a transactional variable. The idea is simple: because 
the TM implementation needs to  keep track of which 
transactions access which memory locations, we can use 
this tracking mechanism to detect accesses to specific lo- 
cations. Particularly, with the HyTM implementation 
described in Section 2, we can mark the orec that corre- 
sponds to the memory location we would like to  watch, 
and invoke the debugger whenever a transaction gets 
ownership of such an orec. In the hardware code path, 
when checking an orec for a possible conflict with a soft- 
ware transaction, we can also check for a watchpoint 
indication on that orec. Depending on the particular 
hardware TM support available, it may or may not be 
possible to transfer control to the debugger while keep 
ing the transaction viable. If not, it may be necessary 
to abort the hardware transaction and retry the trans- 
action in software. 
The debugger can mark an orec with either a stop- 
on-read or stop-on-write marking. With the first mark- 
ing, the debugger is invoked whenever a transaction 
gets read ownership of that orec (note that some TM 
implementations allow multiple transactions to concur- 
rently own an orec in read mode), and with the latter, 
it is invoked only when a transaction gets write owner- 
ship of that orec. When invoked, the debugger should 
first check whether the accessed variable is one of the 
watchpoint's variables (multiple memory locations may 
be mapped to the same orec). If so, then the debugger 
should stop, or, in the case of a conditional watchpoint, 
evaluate a predicate to decide whether to  stop. 
Stopping the program upon access to a watchpoint 
variable can be done in one of two ways: 
1. Immediate-Stop: The debugger can be invoked im- 
mediately when the variable is accessed. While 
this gives the user control a t  the first time the 
variable is accessed, it has some disadvantages: 
The first value written by the transaction to 
the variable may not be the actual value fi- 
nally written by the transaction: the trans- 
action may later change the value written to 
this variable, or abort without modifying the 
variable at  all. In many cases, the user would 
not care about these intermediate values of 
the variable, or about accesses done by trans- 
actions that do not eventually commit. 
Most STMs do not reacquire ownership of a 
location if the transaction modifies it multiple 
times. Therefore, if we stop execution only 
when the orec is first acquired, we may miss 
subsequent modifications that establish the 
predicate we are attempting to detect. 
2. Stop-on-Commit: This option overcomes the prob- 
lems of the immediate-stop approach, by delaying 
the stopping to the point when the transaction 
commits. That is, instead of invoking the debug- 
ger whenever a marked orec is acquired by a trans- 
action, we invoke it when a transaction that owns 
the orec commits; this can be achieved for example 
by recording an indication that the transaction has 
acquired a marked orec when it does so, and then 
invoking the debugger upon commit if this indica- 
tion is set. That way the user sees the value actu- 
ally written to  the variable, since a t  that point no 
other transaction can abort the triggering trans- 
action anymore. While this approach has many 
advantages over the immediate-stop approach, it 
also has the disadvantage that the debugger will 
never stop on an aborted transaction that tried 
to modify the variable, which in some cases might 
be desirable for example when chasing a slippery 
bug that rarely occurs. Therefore, it may be desir- 
able to support both options, and allow the user 
to  choose between them. Also, when using the 
stop-on-commit approach, the user cannot see how 
exactly the written value was calculated by the 
transaction, although this problem can be miti- 
gated by the replay debugging technique describes 
in Section 4.6. 
While the above description assumes a TM imple- 
mentation that uses orecs, the techniques we propose 
are also applicable to other TM approaches. For exam- 
ple, in object-based TM implementations like the one 
by Herlihy et. al. [6], we can stop on any access to  an 
object since any such access requires opening the object 
first, so we can change the method used for opening an 
object to check whether a watchpoint was set on that 
object. This might be optimized by recording an indi- 
cation in an object header or handle that a watchpoint 
has been set on that object. 










































In some cases, the user may want to put a watchpoint 
on a field whose location may dynamically change. Sup- 
pose, for example, that the user is debugging a linked 
list implementation, and wishes to stop whenever some 
transaction accesses the value in the first node of the 
list, or when some predicate involving this value is sat- 
isfied. The challenge is that the address of the field 
storing the value in the first node of the list is indicated 
by head->value, and this address changes when head 
is changed, for example when inserting or removing the 
first node in the list. In this case, the address of the 
variable being watched changes. We denote this type of 
a watchpoint as a dynamic watchpoint. 
We can implement a dynamic watchpoint on head->val 
as follows: when the user asks to put a watchpoint 
on head->value, the debugger puts a regular watch- 
point on the current address of head->value, and a 
special debugger-watchpoint on the address of head. 
The debugger-watchpoint on head is special in the sense 
that it does not give the control to the user when head 
is accessed: instead, the debugger cancels the previous 
watchpoint on head->value a t  that point, and puts a 
new watchpoint on the new location of head->value. 
That is, the debugger uses the debugger-watchpoint on 
head to detect when the address of the field the user 
asked to watch is changed, and changes the watchpoint 
on that field accordingly. 
4.4.2 Multi-Variable Conditional Watchpoints 
Watching multiple variables together may also be use- 
ful when the user would like to condition the watch- 
point on more than one variable: for example, to stop 
only if the sum of two variables is greater than some 
value. We denote such a watchpoint as a multi-variable 
conditional-watchpoint. With such a watchpoint, the 
user asks the debugger to stop on the first memory mod- 
ification that satisfies the predicate. 
To implement a multi-variable conditional watchpoint, 
the debugger can place a watchpoint on each of the 
variables, and evaluate the predicate whenever one of 
these variables is modified. We denote by the triggering 
transaction the transaction that caused the predicate 
evaluation to be invoked. One issue to be considered 
is that evaluating the predicate requires accessing the 
other watched variables. This can be done a s  follows: 
a The debugger uses the stop-on-commit approach, 
so that when a transaction that modifies any of the 
predicate variables commits, we stop execution ei- 
ther before or after the transaction commits. In 
either case, we ensure that the transaction still has 
ownership of all of the orecs it accessed, and we 
ensure that these ownerships are not revoked by 
any other threads that continue to run, for exam- 
ple by making the triggering transaction a super- 
transaction. 
a When evaluating the predicate, the debugger dis- 
tinguishes between two kinds of variables: ones 
that were accessed by the triggering transaction, 
which we denote as triggering variables, and the 
rest which we denote as external variables. Ex- 
ternal variables might be accessed by using the 
stopped transaction, or by using another transac- 
tion initiated by the debugger. In the latter case, 
because the triggering transaction is stopped and 
retains ownership of the orecs it accessed while 
the new transaction that evaluates the external 
variables executes, the specified condition can be 
evaluated atomically. 
While reading the external variables, conflicts with 
other transactions that access these variables may 
occur. One option is to simply abort the conflict- 
ing transaction. However, this may be undesir- 
able, because we may prefer that the debugger 
Lue has minimal impact on program execution. As 
discussed in Section 4.2.2, it is possible in some 
cases to determine the pre-transaction value for 
the watched variable without aborting the trans- 
action that is accessing it. 
4.5 Delayed Breakpoints 
Stopping at  a breakpoint and running the program 
step-by-step affects the behavior of the program, and 
particularly the timing of interactions between the threads. 
Placing a breakpoint inside an atomic block may result 
in even more severe side-effects, because the behavior of 
atomic blocks may be very sensitive to timing modifica- 
tions since they may be aborted by concurrent conflict- 
ing transactions. These effects may make it difficult to 
reproduce a bug scenario. 
To exploit the benefits of breakpoint debugging while 
attempting to minimize such effects, we suggest the de- 
layed breakpoint mechanism. A delayed breakpoint is a 
breakpoint in an atomic block that does not stop the 
execution of the program until the transaction imple- 
menting the atomic block commits. To support delayed 
breakpoints, rather than stopping program execution 
when an instruction marked as a delayed breakpoint is 
executed, we merely set a flag that indicates that the 
transaction has hit a delayed breakpoint, and resume 
execution. Later, upon committing, we stop the pro- 
gram execution if this indication is set. Besides the 
advantage of impacting execution timing less, this tech- 
nique also avoids stopping execution in the case that a 
transaction executes a breakpoint instruction, but then 
aborts (either explicitly or due to a conflict with another 
transaction). In many cases, it will be preferable to only 
stop at  a breakpoint in a transaction that subsequently 
commits. 
One simple type of a delayed breakpoint stops on the 
instruction following the atomic block if the transac- 
tion implementing the atomic block hit the breakpoint 
instruction in the atomic block. This kind of delayed 
breakpoint can be implemented even when the transac- 
tion executing the atomic block is done using HTM. The 
debugger simply replaces the breakpoint-instruction in 
the HTM-based implementation to branch to a piece of 
code that executes that instruction, and raises a flag 
indicating that the execution should stop on the in- 
struction following the atomic block. This simple ap- 
proach has the disadvantage that the values written by 
the atomic block may have already been changed by 







































a state of the world that differs from the state when the 
breakpoint instruction was hit. Moreover. if the trans- 
action is executed in hardware, then unless there is spe- 
cific hardware support for this purpose, the user would 
not be able to get any information about the transaction 
execution (like which values were readlwritten, etc.). 
On the other hand, if the atomic block is executed 
by a software transaction, we can have a more powerful 
type of a delayed breakpoint, which stops at the com- 
mi t  point of the executing transaction. More precisely, 
the debugger tries to stop at  a point during the com- 
mit operation of that transaction in which the transac- 
tion is guaranteed to commit successfully, but that no 
other transaction has seen its effects on memory. This 
can be done by having the commit operation check the 
flag that indicates if a delayed-breakpoint placed in the 
atomic block was hit by the transaction, and if so do 
the following: 
1. Make the transaction a super-transaction (see Sec- 
tion 4.1.1 for details). 
2. Validate the transaction. That is, make sure that 
the transaction can commit. If validation fails, 
abort the transaction, fail the commit operation, 
and resume execution. 
3. Give control to the user. 
4. When the user asks to continue execution, com- 
mit the transaction. Note that, depending on how 
super-transactions are supported, a lightweight com- 
mit may be applicable here if we can be sure that 
the transaction cannot be aborted after becoming 
a super-transaction. 
The idea behind the above procedure is simple: Guar- 
antee that all future conflicts will be resolved in favor 
of the transaction that hit the breakpoint, check that 
the transaction can still commit, and then give control 
to  the user, who can subsequently decide to commit the 
transaction. 
At Step 3 the debugger stops the execution of the 
commit operation and gives control to  the user. This is 
the point where the user gets to  know that a cornmit- 
ted execution of the atomic block has hit the delayed 
breakpoint. At that point, the user can view various 
variables, including those accessed by the transaction, 
to try to understand the effect of that execution. In 
Section 4.6, we describe other techniques that can give 
the user more information about the committed trans- 
action's execution a t  that point. 
4.5.1 Combining with Atomic Groups 
One disadvantage of using a delayed breakpoint is 
that if the user views variables not accessed by the 
transaction, the values seen are a t  the time the debugger 
stops rather than the time of the breakpoint-instruction 
execution. Therefore, it may be useful to combine the 
delayed breakpoint mechanism with the atomic group 
feature (Section 4.3): with this combination, the user 
can associate with the delayed breakpoint an atomic 
group of variables whose values should be recorded when 
the delayed breakpoint instruction is executed. When 
the delayed breakpoint instruction is hit, besides trig- 
gering a breakpoint at  the end of the transaction, the 
debugger gets the atomic group's value (as described in 
Section 4.3), and presents it to the user when it later 
stops in the transaction's commit phase. 
4.6 Replay Debugging for Atomic Blocks 
I t  is useful to  be able to determine how the program 
reached a breakpoint. Replay debugging has been sug- 
gested in a variety of contexts to support such func- 
tionality, and support ranging from special hardware to  
user libraries have been proposed (see 112, 141 for two 
recent examples). Replay debugging for multithreaded 
concurrent applications generally requires logging that 
can add significant overhead. In this section, we explain 
how STM infrastructure can be exploited to support re- 
playing atomic blocks, without the need for additional 
logging. We also explain how the user can experiment 
with alternative executions of the atomic block by mod- 
ifying data and even commit an alternative execution 
instead of the original one. To our knowledge, previous 
replay debugging proposals do not include such func- 
tionality. 
The idea behind our replay debugging technique is to 
exploit the fact that the behavior of most atomic blocks 
is uniquely determined by the values it reads from mem- 
ory3. Some STM implementations record values read by 
the transaction in a readset. Others preserve these val- 
ues in memory until the transaction commits, at which 
point the values may be overwritten by new values writ- 
ten by the transaction. In either case, if we modify 
the STM to allow the debugger access to this informa- 
tion, then the debugger can reconstruct execution of the 
transaction, as explained in more detail below: 
The debugger maintains its own write-set for the 
transaction. This is necessary to  allow the de- 
bugger to determine the values returned by reads 
from locations that the transaction has previously 
written. The replay begins with an empty write 
set. 
The replay procedure starts from the beginning of 
the debugged atomic block, and executes all in- 
structions that are not STM-library function calls 
as usual. 
The replay procedure ignores all STM library func- 
tion calls except the ones that implement the trans- 
actional readlwrite operations. 
When the replay procedure reaches a transactional 
write operation, it writes the value in the write set 
maintained by the debugger. 
When the replay procedure reaches a transactional 
read o~era t ion.  it first searches the write set main- 
tained by the debugger. If a value for the address 
3We call such atomic blocks transactionally de temin i s -  
tic. While the techniques described in this section may 
be useful even for blocks that the compiler cannot prove 
are transactionally deterministic, in this case the user 
should be informed that the displayed execution might 
not be identical to the one that triggered the breakpoint. 
t t f t rl t t iff rs fr t t t t
rea i t i str cti as it. oreover, if t e tra s-
ti is t i r r , t l ss t r is s -
ifi r r rt f r t i r , t r l
t l t t i f r ti t t tr ti
ti li i l / itt , t .).
t t , if t t i l i t
t t ti , l
t f l r i t, i t t t -
it i t t ti t ti . i l ,
t r tri t t t i t ri t -
ti
ti i r t t it f ll , t t t
t t ti it ff t r . i
i t it ti t
fl g t t i i t i l i t l i t
t i ti ,
ll ing:
. t t ti
. . r t il ).
. l ,
it. l ,
ti , il ti ,


























t r . ): i ti ,
i t
i l
t l i t i t ti i t .
t e ela e rea i t i str cti is it, esi es tri -
ri r i t t t f t tr s ti , t
r ts t t i r ' l ( s s ri i
ecti . ), rese ts it t t ser e it later
t i t tr ti ' it .
. l i i l
i l t l t t i t
r r i t. l i s s -
st i ri t f t ts t rt s f -
ti lit , i i l
[ , ]
r t l ). l i f r ltit r a ed



























t ti ll t i i ti , i t i t r
l i i l ti i t
being read is there, this is the value read by the 
transactional read operation. Otherwise, the orig- 
inal value read by the transaction is used (acquired 
from the readset or from memory, depending on 
the STM implementation). 
Because the debugged transaction retains ownership 
of orecs it acquired during the original execution, mem- 
ory locations it accesses cannot change during replaying, 
so the replayed execution is faithful to  the original. 
Replay debugging functionality can be combined with 
various other features we have described. For example, 
by combining replay debugging with the delayed break- 
point feature described in Section 4.5, we can create 
the illusion that control has stopped inside an atomic 
block, although it has actually already run to  its commit 
point. Then, the replay functionality allows the user to 
step through the remainder of the atomic block before 
committing it. It is even possible to allow experimenta- 
tion with alternative executions of a debugged atomic 
block, for example by changing values it reads or writes. 
In some cases, we may wish to  do so without affecting 
the actual program execution. In other cases, we may 
prefer to change the actual execution, and subsequently 
resume normal debugging. One way to handle the latter 
case is to abort the current transaction without releas- 
ing orecs, and replay it up to  the point at  which the 
user wishes to change something. This way, we guaran- 
tee that the transaction will reexecute up to this point 
identically to how it did in the first place. 
Combining replay debugging with other debugger fea- 
tures we have proposed can support a rather powerful 
debugging environment for transactional programs. 
Acknowledgements 
We thank Maurice Herlihy for suggesting the ability to 
see a transaction's tentative values (Section 4.2.3). 
5. REFERENCES 
[l] ANANIAN, C. S., ASANOVI~ ,  K., KUSZMAUL, 
B. C., LEISERSON, C. E., AND LIE, S. 
Unbounded transactional memory. In Proceedings 
of the 11th International Symposium on 
High-Performance Computer Architecture 
(HPCA '05) (San Franscisco, California, Feb. 
2005), pp. 316-327. 
[2] ANANIAN, C. S., AND RINARD, M. Efficient 
object-based software transactions. In Workshop 
on Synchronization and Concurrency in 
Object-Oriented Languages (SCOOL) (Oct. 2005). 
[3] FRASER, K. Practical lock freedom. PhD thesis, 
Cambridge University Computer Laboratory, 
2003. Also available as Technical Report 
UCAM-CL-TR-579. 
[4] HAMMOND, L., WONG, V., CHEN, M., 
CARLSTROM, B. D., DAVIS, J .  D., HERTZBERG, 
B., PRABHU, M. K., WIJAYA, H., KOZYRAKIS, 
C., AND OLUKOTUN, K. Transactional memory 
coherence and consistency. In Proceedings of the 
31st Annual International Symposium on 
Computer Architecture. IEEE Computer Society, 
Jun 2004, p. 102. 
[5] HERLIHY, M. Wait-free synchronization. ACM 
Transactions on Programming Languages and 
Systems 13, 1 (January 1991), 124-149. 
[6] HERLIHY, M., LUCHANGCO, V., MOIR, M., A N D  
SCHERER, W .  N. Software transactional memory 
for dynamic-sized data structures. In Proceedings 
of the 22nd Annual ACM Symposium on 
Principles of Distributed Computing (Jul 2003), 
pp. 92-101. 
[7] HERLIHY, M., A N D  MOSS, J .  Transactional 
memory: Architectural support for lock-free data 
structures. Tech. Rep. CRL 92/07, Digital 
Equipment Corporation, Cambridge Research 
Lab, 1992. 
[8] KUMAR, S., CHU, M., HUGHES, C., KUNDU, P . ,  
A N D  NGUYEN, A.  Hybrid transactional memory. 
In Preceedings of the 1 l th A CM SIGPLA N 
Symposium on Principles and Practice of Parallel 
Programming (2006). 
[9] LEV, Y., AND MAESSEN, J .  Towards a safer 
interaction with transactional memory by tracking 
object visibility. In Workshop on Synchronization 
and Concurrency in Object-Oriented Languages 
(SCOOL) (Oct . 2005). 
[lo] MARATHE, V. J . ,  SCHERER 111, W.  N., AND 
SCOTT, M. L. Adaptive software transactional 
memory. Tech. rep., Cracow, Poland, Sep 2005. 
Earlier but expanded version available as T R  868, 
University of Rochester Computer Science Dept., 
May 2005. 
[ll] MOIR, M. Hybrid transactional memory, Jul 
2005. http://www.cs.wisc.edu/trans- 
memory/misc-papers/moir:hybrid-tm:tr:2005.pdf. 
[12] NARAYANASAMY, S., POKAM, G., A N D  CALDER, 
B. Bugnet: Continuously recording program 
execution for deterministic replay debugging. In 
ISCA '05: Proceedings of the 32nd Annual 
International Symposium on Computer 
Architecture (Washington, DC, USA, 2005), IEEE 
Computer Society, pp. 284-295. 
[13] RAJWAR, R. ,  HERLIHY, M., AND LAI, K. 
Virtualizing transactional memory. In ISCA '05: 
Proceedings of the 32nd Annual International 
Symposium on Computer Architecture 
(Washington, DC, USA, 2005), IEEE Computer 
Society, pp. 494-505. 
[14] SAITO, Y. Jockey: A user-space library for 
record-replay debugging. Technical Report 
HP-2006-46, HP Laboratories, Palo Alto, CA, 
March 2005. 
[15] SCHERER, W.,  AND SCOTT, M. Advanced 
contention management for dynamic software 
transactional memory. In Proc. 24th Annual ACM 
Symposium on Principles of Distributed 
Computing (2005). 
[16] SHAVIT, N., AND TOUITOU, D. Software 
transactional memory. Distributed Computing 10, 












1] ANIAN, . ., SANOVIC, . SZMAUL,




] ANIAN, . ., INARD,
or
t
] ASER, . .
.
] MMOND, , NG, EN, ,
RLSTROM, , VIS, RTZBERG,













] RLIHY, , oss, .
. .
.





] , ES EN, .
or
. ).







p:/ w . i ed trans-
emory1misc-papers/ oir: ybri : r:20 5.pdf.





3] JWAR, ., RLIHY, , I, .
f
5.









6] AVIT, , UITOU,
,
.
Session 2: Hardware Transactional Memory ession : ar r r t r

Hardware Acceleration of Software Transactional Memory * 
Arrvindh Shriraman Virendra J. Marathe Sandhya Dwarkadas Michael L. Scott 
David Eisenstat Christopher Heriot William N. Scherer 111 Michael F. Spear 
Department of Computer Science, University of Rochester 
{ashriram,vmarathe,sandhya,scott,eisen,cheriot,scherer,spear}@cs.rochester.edu 
Abstract 
Transactional memory (TM) systems seek to increase scalabil- 
ity, reduce programming complexity, and overcome the various se- 
mantic problems associated with locks. Software TM proposals run 
on stock processors and provide substantial flexibility in policy, but 
incur significant overhead for data versioning and validation in the 
face of conflicting transactions. Hardware TM proposals have the 
advantage of speed, but are typically highly ambitious, embed sig- 
nificant amounts of policy in silicon, and provide no clear migration 
path for software that must also run on legacy machines. 
We advocate an intermediate approach, in which hardware is 
used to accelerate a TM implementation controlled fundamentally 
by software. We present a system, RTM, that embodies this ap- 
proach. It consists of a novel transactional MESI (TMESI) pro- 
tocol and accompanying TM software. TMESI eliminates the key 
software overheads of data copying, garbage collection, and vali- 
dation, without introducing any global consensus algorithm in the 
cache coherence protocol (a commit is allowed to perform using 
only a few cycles of completely local operation). The only change 
to the snooping interface is a "threatened" signal analogous to the 
existing "shared" signal. 
By leaving policy to software, RTM allows us to experiment 
with a wide variety of policies for contention management, dead- 
lock and livelock avoidance, data granularity, nesting, and virtual- 
ization. 
1. Introduction and Background 
Moore's Law has hit the heat wall. Simultaneously, the ability to 
use growing on-chip real estate to extract more instruction-level 
parallelism (ILP) is also reaching its limits. Major microproces- 
sor vendors have largely abandoned the search for more aggres- 
sively superscalar uniprocessors, and are instead designing chips 
with large numbers of simpler, more power-efficient cores. The im- 
plications for software vendors are profound: for 40 years only the 
most talented programmers have been able to write good thread- 
level parallel code; now everyone must do it. 
Parallel programs have traditionally relied on mutual exclusion 
locks, but these suffer from both semantic and performance prob- 
lems: they are vulnerable to deadlock, priority inversion, and ar- 
bitrary delays due to preemption. In addition, while coarse-grain 
lock-based algorithms are easy to understand, they limit concur- 
* Presented at TRANSACT: the First ACM SIGPLAN Workshop on Lan- 
guages, Compilers, and Hardware Support for Transactional Computing, 
held in conjunction with PLDI, Ottawa, Ontario, Canada, June 2006. 
This work was supported in part by NSF grants CCR-0204344. CNS- 
041 1127, and CNS-0509270; an IBM Faculty Partnership Award; financial 
and equipment support from Sun Microsystems Laboratories; and financial 
support from Intel. 
rency. Fine-grain locking algorithms are thus often required, but 
these are difficult to design, debug, maintain, and understand. 
Ad hoc nonblocking algorithms [IS, 16, 24, 251 solve the se- 
mantic problems of locks by ensuring that forward progress is never 
precluded by the state of any thread or set of threads. They provide 
performance comparable to fine-grain locking, but each such algo- 
rithm tends to be a publishable result. 
Clearly, what we want is something that combines the semantic 
advantages of ad hoc nonblocking algorithms with the conceptual 
simplicity of coarse-grain locks. Transactional memory promises to 
do so. Originally proposed by Herlihy and Moss [8], transactional 
memory (TM) borrows the notions of atomicity, consistency, and 
isolation from database transactions. In a nutshell, the programmer 
or compiler labels sections of code as atomic and relies on the 
underlying system to ensure that their execution is linearizable [7], 
consistent, and as highly concurrent as possible. 
Once regarded as impractical, in part because of limits on the 
size and complexity of 1990s caches, TM has in recent years 
enjoyed renewed attention. Rajwar and Goodman's Transactional 
Lock Removal (TLR) [19, 201 speculatively elides acquire and 
release operations in traditional lock-based code, allowing critical 
sections to execute in parallel so long as their write sets fit in cache 
and do not overlap. In the event of conflict, all processors but one 
roll back and acquire the lock conservatively. Timestamping is used 
to guarantee forward progress. Martinez and Torrellas [13] describe 
a related mechanism for multithreaded processors that identifies, in 
advance, a "safe thread" guaranteed to win all conflicts. 
Ananian et al. [I] argue that a TM implementation must sup- 
port transactions of arbitrary size and duration. They describe two 
implementations, one of which (LTM) is bounded by the size of 
physical memory and the length of the scheduling quantum, the 
other of which (UTM) is bounded only by the size of virtual mem- 
ory. Rajwar et al. [21] describe a related mechanism (VTM) that 
uses hardware to virtualize transactions across both space and time. 
Moore et al. [18] attempt to optimize the common case by making 
transactionally-modified overflow data visible to the coherence pro- 
tocol immediately, while logging old values for roll-back on abort 
(LogTM). Hammond et al. [5] propose a particularly ambitious re- 
thinking of the relationship between the processor and the memory, 
in which everything is a transaction (TCC). However, they require 
heavy-weight global consensus at the time of a commit. 
While we see great merit in all these proposals, it is not yet 
clear to us that full-scale hardware TM will provide the most 
practical, cost-effective, or semantically acceptable implementation 
of transactions. Specifically, hardware TM proposals suffer from 
three key limitations: 
1. They are architecturally ambitious-enough so that commercial 
vendors will require very convincing evidence before they are 
willing to make the investment. 
2. They embed important policies in silicon-policies whose im- 







































evidence suggests that no one static approach may be accept- 
able. 
3. They provide no obvious migration path from current machines 
and systems: programs written for a hardware TM system may 
not run on legacy machines. 
Moir [17] describes a design philosophy for a hybrid transac- 
tional memory system in which hardware makes a "best effort" at- 
tempt to complete transactions, falling back to software when nec- 
essary. The goal of this philosophy is to be able to leverage al- 
most any reasonable hardware implementation. Kumar et al. [lo] 
describe a specific hardware-software hybrid that builds on the 
software system of Herlihy et al. [6]. Unfortunately, this system 
still embeds significant policy in silicon. It assumes, for example, 
that conflicts are detected as early as possible (pessimistic concur- 
rency control), disallowing either read-write or write-write sharing. 
Previous published papers [ l  1 ,221 reveal performance differences 
across applications of 2X - 10X in each direction for different ap- 
proaches to contention management, metadata organization, and 
eagerness of conflict detection (i.e., write-write sharing). It is clear 
that no one knows the "right" way to do these things; it is likely 
that there is no one right way. 
We propose that hardware serve simply to optimize the perfor- 
mance of transactions that are controlled fundamentally by soft- 
ware. This allows us, in almost all cases, to cleanly separate policy 
and mechanism. The former is the province of software, allowing 
flexible policy choice; the latter is supported by hardware in cases 
where we can identify an opportunity for significant performance 
improvement. 
We present a system, RTM, that embodies this software-centric 
hybrid strategy. RTM comprises a Transactional MESI (TMESI) 
coherence protocol and a modified version of our RSTM software 
TM [12]. TMESI extends traditional snooping coherence with a 
"threatened" signal analogous to the existing "shared" signal, and 
with several new instructions and cache states. One new set of states 
allows transactional data to be hidden from the standard coherence 
protocol, until such time as software permits it to be seen. A second 
set allows metadata to be tagged in such a way that invalidation 
forces an immediate abort. 
In contrast to most software TM systems, RTM eliminates, in 
the common case, the key overheads of data copying, garbage col- 
lection, and consistency validation. In contrast to pure hardware 
proposals, it requires no global consensus algorithm in the cache 
coherence protocol, no snapshotting of processor state, and mes- 
sage traffic comparable to that of a regular MESI coherence pro- 
tocol. Nonspeculative loads and stores are permitted in the middle 
of transactions-in fact they constitute the hook that allows us to 
implement policy in software. Among other things, we rely on soft- 
ware to determine the structure of metadata, the granularity of con- 
currency and sharing (e.g., word vs. object-based), and the degree 
to which conflicting transactions are permitted to proceed specu- 
latively in parallel. (We permit, but do not require, read-write and 
write-write sharing, with delayed detection of conflicts.) Finally, 
we employ a software contention manager [22,23] to arbitrate con- 
flicts and determine the order of commits. 
Because conflicts are handled in software, speculatively writ- 
ten data can be made visible at commit time with only a few cy- 
cles of entirely local execution. Moreover, these data (and a small 
amount of nonspeculative metadata) are all that must remain in the 
cache for fast-path execution: data that were speculatively read or 
nonspeculatively written can safely be evicted at any time. Like 
the proposals of Moir and of Kumar et al., RTM falls back to a 
software-only implementation of transactions in the event of over- 
flow (or at the discretion of the contention manager), but in contrast 
not only to the hybrid proposals, but also to TLR, LTM, VTM, and 
LogTM, it can accommodate "fast path" execution of dramatically 
larger transactions with a given size of cache. 
TMESI is intended for implementation either at the L1 level of 
a CMP with a shared L2 cache, or at the L2 level of an SMP with 
write-through L1 caches. We believe that implementations could 
also be devised for directory-based machines (this is one topic 
of our ongoing work). TMESI could also be used with a variety 
of software systems other than RTM. We do not describe such 
extensions here. 
Section 2 provides more detailed background and motivation for 
RTM, including an introduction to software TM in general, a char- 
acterization of its dominant costs, and an overview of how TMESI 
and RTM address them. Section 3 describes TMESI in detail, in- 
cluding its instructions, its states and transitions, and the mecha- 
nism used to detect conflicts and abort remote transactions. Sec- 
tion 4 then describes the RTM software that leverages this hard- 
ware support. Our choice of concrete policies reflects experimen- 
tation with several software TM systems, and incorporates several 
forms of dynamic adaptation to the offered workload. We conclude 
in Section 5 with a summary of contributions, a brief description of 
our simulation infrastructure (currently nearing completion), and a 
list of topics for future research. 
2. RTM Overview 
Software TM systems display a wide variety of policy and imple- 
mentation choices. Our RSTM system [12] draws on experience 
with several of these in an attempt to eliminate as much software 
overhead as possible, and to identify and characterize what re- 
mains. RTM is, in essence, a derivative of RSTM that uses hard- 
ware support to reduce those remaining costs. A transaction that 
makes full use of the hardware support is called a hardware trans- 
action. A transaction that has abandoned that support (due to over- 
flow or policy decisions made by the contention manager) is called 
a sofhoare transaction. 
2.1 Programming Model 
Like most (though not all) STM systems, RTM is object-bared: 
updates are made, and conflicts arbitrated, at the granularity of 
language-level objects.' Only those objects explicitly identified as 
Shared are protected by the TM system. Shared objects can- 
not be accessed simultaneously in both transactional and non- 
transactional mode. Other data (local variables, debugging and 
logging information, etc.) can be accessed within transactions, but 
will not be rolled back on abort. 
Before a Shared object can be used within a transaction, it 
must be opened for read-only or read-write access. RTM enforces 
this rule using C++ templates and inheritance, but a functionally 
equivalent interface could be defined through convention in C. The 
open-R0 method returns a pointer to the current version of an ob- 
ject, and performs bookkeeping operations that allow the TM sys- 
tem to detect conflicts with future writers. The open-RW method, 
when executed by a software transaction, creates a new copy, or 
clone of the object, and returns a pointer to that clone, allowing 
other transactions to continue to use the old copy. As in software 
TM systems, a transaction commits with a single compare-and- 
swap (CAS) instruction, after which any clones it has created are 
immediately visible to other transactions. (Like UTM and LogTM, 
software and hybrid TM systems employ what Moore et al. refer 
to as eager version management [18].) If a transaction aborts, its 
clones are discarded. RTM currently supports nested transactions 
only via subsumption in the parent. 
Figure 1 contains an example of C++ RTM code to insert an 
element in a singly-linked sorted list of integers. The API is in- 























































void intset::insert(int val) I 
BEGIN-TRANSACTION; 
const node* previous = head->open-ROO; 
/ /  points to sentinel node 
const node* current = previous->next->open-ROO; 
/ /  points to first real node 
while (current !=  NULL) I 
if (current->val >= val) break; 
previous = current; 
current = current->next->open-ROO ; 
if (!current I (  current->val > val) I 
node *n = new node(va1, current->shared()); 
// uses Object<T>::operator new 




Figure 1. Insertion in a sorted linked list using RTM. 
herited from our RSTM system [12], which runs on legacy hard- 
ware (space limitations preclude a full presentation here). The 
r t m :  :Shared<T> template class provides an opaque wrapper 
around transactional objects. sever2 crucial methods, including 
o p e r a t o r  new, are provided by r tm: :Object<T>, from which T 
must be derived. Within a transaction, bracketed by BEGIN-TRANS- 
ACTION and END-TRANSACTION macros, the open-ROO and 
open-RWO methods can be used to obtain const  T* and T* 
pointers respectively. The s h a r e d o  method performs the inverse 
operation, returning a pointer to the Shared<T> with which t h i s  
is associated. Our code traverses the list from the head, opening ob- 
jects in read-only mode, until it finds the proper place to insert the 
element. It then re-opens the object whose nex t  pointer it needs 
to modify in read-write mode. To make such upgrades convenient, 
Object<T> : : open-RW returns sharedo->open-RWO. 
2.2 Software Implementation 
The two principal metadata structures in RTM are the transaction 
descriptor and the object header. The descriptor contains an indi- 
cation of whether the transaction is active, committed, or aborted. 
The header contains a pointer to the descriptor of the most recent 
transaction to modify the object, together with pointers to old and 
new clones of the data. If the most recent writer committed in soft- 
ware, the new clone is valid; otherwise the old clone is valid. 
Before it can commit, a transaction T must acquire the headers 
of any objects it wishes to modify, by making them point at its 
descriptor. By using a CAS instruction to change the status word in 
the descriptor from active to committed, a transaction can then, in 
effect, make all its updates valid in one atomic step. Prior to doing 
so, it must also verify that all the object clones it has been reading 
are still valid. 
Acquisition is the hook that allows RTM to detect conflicts 
between transactions. If a writer R discovers that a header it wishes 
to acquire is already "owned" by some other, still active, writer S, 
R consults a software contention manager to determine whether to 
abort S and steal the object, wait a bit in the hope that S will finish, 
or abort R and retry later. Similarly, if any object opened by R 
(for read or write) has subsequently been modified by an already- 
committed transaction, then R must abort. 
RTM can perform acquisition as early as open time or as late 
as just before commit. The former is know as eager acquire, the 
latter as lazy acquire. Most hardware TM systems perform the 
equivalent of acquisition by requesting exclusive ownership of a 
cache line. Since this happens as soon as the transaction attempts 
to modify the line, these systems are inherently restricted to eager 
c o n f i t  management [18]1 They are also restricted to contention 
4.5 
M ASTM 
I I I I I I 
0 5 10 15 20 25 30 
Threads 
Figure 2. Performance scaling of RSTM, ASTM, and coarse-grain 






c M, Metadata Management 





Linkdd~ist   ash RB~ree-  RBTree- counter 
Small Large 
Benchmark 
Figure 3. Cost breakdown for RSTM on a single processor, for 
five different microbenchmarks. 
management algorithms simple enough (and static enough) to be 
implemented in hardware on a cache miss. 
Work by Marathe et al. [ l l ]  suggests that TM systems should 
choose between eager and lazy conflict detection based on the 
characteristics of the application, in order to obtain the best per- 
formance (we employ their adaptive heuristics). Likewise, work 
by Scherer et al. [22, 231 suggests that the preferred contention 
management policy is also application-dependent, and may alter 
program run time by as much as an order of magnitude. In both 
these dimensions, FXM provides significantly greater flexibility 
than pure hardware TM proposals. 
2.3 Dominant Costs 
Figure 2 compares the performance of RSTM (the all-software 
system from which RTM is derived) to that of coarse-grain locking 
on a hash-table microbenchmark as we vary the number of threads 
from 1 to 32 on a 16-processor 1.2GHz SunFire 6800. Also shown 
is the performance (in Java) of ASTM, previously reported [ l l ]  to 
match the faster of Sun's DSTM [6] and the Cambridge OSTM [3] 
across a variety of benchmarks. Each thread in the microbenchmark 
repeatedly inserts, removes, or searches for (one third probability 
of each) a random element in the table. There are 256 buckets, and 
all values are taken from the range 0-255, leading to a steady-state 
average of 0.5 elements per bucket. 
Unsurprisingly, coarse-grain locking does not scale. Increased 
contention and occasional preemption cause the average time per 
transaction to climb with the number of threads. On a single proces- 
sor, however, locking is an order of magnitude faster than ASTM, 
and more than 3 x  faster than RSTM. We need about 4 active 
threads in this program before software TM appears attractive from 
a performance point of view. 
Instrumenting code for the single-processor case, we can appor- 
tion costs as shown in Figure 3, for five different microbenchmarks. 
i r . rf r li f , , r - r i




















i i t t: : i sert (i t l) {
I _T S I ;
st e r i s ead->open_R ();
II i t to ti l
t rr t r ious->next->open_R ();
II i t o
hil (c rre t ! ) {




!c re t I I t l l {
w e(val. r t- shared(»;
II bj t<T : : er t r
ious->o _ ()->next de>(n):
}



















t ti t i l ,




• f l r











t : : T> l t
S eral
t r , : j ct >,
ri . i _TRANS-
I _ I _
en_ ()
ti l . ared ()
r ti n, red<T>
is
j i , l s
l ent. I
t i .
bj ect<T> : : _ () ->ope _ () .
. e l entation
i i l II
i t er.








t i t r itt ,
t, ll t .
, t l i l
r till lid.
II
t i r, ll , ,
lts i
rt t l j t, ,
t t t r. il rl ,
t.
s it. ,
l tt s l ir .
li e. i
t i ,
onflict t [ ]. r ls r stri t t t ti
i . ti i t li li t i .
3 200615118
Four-the hash table of Figure 2, the sorted list whose insert opera- 
tion appeared in Figure 1, and two red-black trees-are implemen- 
tations of the same abstract set. The fifth represents the extreme 
case of a trivial critical section-in this case one that increments a 
single integer counter. 
In all five microbenchmarks TM overhead dwarfs real execution 
time. Because they have significant potential parallelism, however, 
both HashTable and RBTree outperform coarse-grain locks given 
sufficient numbers of threads. Parallelism is nonexistent in Counter 
and limited in LinkedList: a transaction that updates a node of the 
list aborts any active transactions farther down the list. 
Memory management in Figure 3 includes the cost of allo- 
cating, initializing, and (eventually) garbage collecting clones. 
The total size of objects written by all microbenchmarks other 
than RBTree-Large (which uses 4 KByte nodes instead of the 40 
byte nodes of RBTree-Small) is very small. As demonstrated by 
RBTree-Large, transactions that access a very large object (espe- 
cially if they update only a tiny portion of it) will suffer enormous 
copying overhead. 
In transactions that access many small objects, validation is 
the dominant cost. It reflects a subtlety of conflict detection not 
mentioned in Section 2.2. Suppose transaction R opens objects 
X and Y in read-only mode. In between, suppose transaction S 
acquires both objects, updates them, and commits. Though R is 
doomed to abort (the version of X has changed), it may temporarily 
access the old version of X and the new version of Y. It is not 
difficult to construct scenarios in which this mutual inconsistency 
may lead to arbitrary program errors, induced, for example, by 
stores or branches employing garbage pointers. (Hardware TM 
systems are not vulnerable to this sort of inconsistency, because 
they roll transactions back to the initial processor and memory 
snapshot the moment conflicting data becomes visible to the cache 
coherence protocol.) 
Without a synchronous hardware abort mechanism, RSTM (like 
DSTM and ASTM) requires R to double-check the validity of all 
previously opened objects whenever opening something new. For 
a transaction that accesses a total of n objects, this incremental 
validation imposes O(n2) total overhead. 
As an alternative to incremental validation, Herlihy's SXM [4] 
and more recent versions of DSTM allow readers to add them- 
selves to a visible reader list in the object header at acquire time. 
Writers must abort all readers on the list before acquiring the ob- 
ject. Readers ensure consistency by checking the status word in 
their transaction descriptor on every open operation. Unfortunately, 
the constant overhead of reader list manipulation is fairly high. In 
practice, incremental validation is cheaper for small transactions 
(as in Counter); visible readers are cheaper for large transactions 
with heavy contention; neither clearly wins in the common middle 
ground [23]. RSTM supports both options; the results in Figures 2 
and 3 were collected using incremental validation. 
2.4 Hardware Support 
RTM uses hardware support (the TMESI protocol) to address the 
memory management and validation overhead of software TM. In 
so doing it eliminates the top two components of the overhead bars 
shown in Figure 3. 
1. TMESI protocol allows transactional data, buffered in the local 
cache, to be hidden from the normal coherence protocol. This 
buffering allows RTM, in the common case, to avoid allocating 
and initializing a new copy of the object in software. Like most 
hardware TM proposals, RTM keeps only the new version of 
speculatively modified data in the local cache. The old version 
of any given cache line is written through to memory if nec- 
essary at the time of the first transactional store. The new ver- 
sion becomes visible to the coherence protocol when and if the 
Table 1. ISA Extensions for RTM. 
Instruction 
SetHandler (H) 
TLoad (A, R) 
TStore (R, A) 
ALoad (A, R) 
ARelease (A) 
CAS-Commit (A, 0, N) 
Abort 
Wide-CAS (A, 0, N, K) 
transaction commits. Unlike most hardware proposals (but like 
TCC), RTM allows data to be speculatively read or even written 
when it is also being written by another concurrent transaction. 
TCC ensures, in hardware, that only one of the transactions will 
commit. RTM relies on software for this purpose. 
Description 
Indicate address of user-level abort handler 
Transactional Load from A into R 
Transactional Store from R into A 
Load A into R; tag "abort on invalidate" 
Untag h a d e d  line 
End Transaction 
Invoked by transaction to abort itself 
Update K (currently up to 4) adjacent words 
atomically 
2. TMESI also allows selected metadata, buffered in the local 
cache, to be tagged in such a way that invalidation will cause 
an immediate abort of the current transaction. This mechanism 
allows the RTM software to guarantee that a transaction never 
works with inconsistent data, without incurring the cost of in- 
cremental validation or visible readers (as in software TM), 
without requiring global consensus for hardware commit, and 
without precluding read-write and write-write speculation. 
To facilitate atomic updates to multiword metadata (which 
would otherwise need to be dynamically allocated, and accessed 
through a one-word pointer), RTM also provides a wide compare- 
and-swap, which atomically inspects and updates several adjacent 
locations in memory (all within the same cache line). 
A transaction could, in principle, use hardware support for cer- 
tain objects and not for others. For the sake of simplicity, our ini- 
tial implementation of RTM takes an all-or-nothing approach: a 
transaction initially attempts to leverage TMESI support for write 
buffering and conflict detection of all of its accessed objects. If it 
aborts for any reason, it retries as a software transaction. Aborts 
may be caused by conflict with other transactions (detected through 
invalidation of tagged metadata), by the loss of buffered state to 
overflow or insufficient associativity, or by executing the Abort in- 
struction. (The kernel executes Abort on every context switch.) 
3. TMESI Hardware Details 
In this section, we discuss the details of hardware acceleration for 
common-case transactions, which have bounded time and space 
requirements. In order, we consider ISA extensions, the TMESI 
protocol itself, and support for conflict detection and immediate 
aborts. 
3.1 ISA Extensions 
RTM requires eight new hardware instructions, listed in Table 1. 
The SetHandler instruction indicates the address to which con- 
trol should branch in the event of an immediate abort (to be dis- 
cussed at greater length in Section 3.3). This instruction could be 
executed at the beginning of every transaction, or, with OS kernel 
support, on every heavyweight context switch. 
The T h a d  and TStore instructions are transactional loads and 
stores. All accesses to transactional data are transformed (via com- 
piler support) to use these instructions. They move the target line 
to one of five transactional states in the local cache. Transactional 
states are special in two ways: (1) they are not invalidated by read- 
exclusive requests from other processors; (2) if the line has been 
the subject of a TStore, then they do not supply data in response 
to read or read-exclusive requests. More detail on state transitions 















































The ALoad instruction supports immediate aborts of remote 
transactions. When it acquires a to-be-written object, RTM per- 
forms a nontransactional write to the object's header. Any reader 
transaction whose correctness depends on the consistency of that 
object will previously have performed an ALoad on the header (at 
the time of the open). The read-exclusive message caused by the 
nontransactional write then serves as a broadcast notice that imme- 
diately aborts all such readers. A similar convention for transaction 
descriptors allows hardware transactions to immediately abort soft- 
ware transactions even if those software transactions don't have 
room for all their object headers in the cache (more on this in 
Section 3.3). In contrast to most hardware TM proposals, which 
eagerly abort readers whenever another transaction performs a 
conflicting transactional store, TMESI allows RTM to delay ac- 
quires when speculative read-write or write-write sharing is desir- 
able [ l  11. 
The ARelease instruction erases the abort-on-invalidate tag of 
the specified cache line. It can be used for early release, a software 
optimization that dramatically improves the performance of certain 
transactions, notably those that search large portions of a data 
structure prior to making a local update [6, 111. It is also used by 
software transactions to release an object header after copying the 
object's data. 
The CAS-Commit instruction performs the usual function of 
compare-and-swap. In addition, speculatively read lines (the trans- 
actional and abort-on-invalidate lines) are untagged and revert to 
their corresponding MESI states. If the CAS succeeds, specula- 
tively written lines become visible to the coherence protocol and 
begin responding to coherence messages. If the CAS fails, specula- 
tively written lines are invalidated, and control transfers to the loca- 
tion registered by SetHandler. The motivation behind CAS-Commit 
is simple: software TM systems invariably use aCAS to commit the 
current transaction; we overload this instruction to make buffered 
transactional state once again visible to the coherence protocol. 
The Abort instruction clears the transactional state in the cache 
in the same manner as a failed CAS-Commit. Its principal use is to 
implement condition synchronization by allowing a transaction to 
abort itself when it discovers that its precondition does not hold. 
Such a transaction will typically then jump to its abort handler. 
Abort is also executed by the scheduler on every context switch. 
The Wide-CAS instruction allows a compare-and-swap across 
multiple contiguous locations (within a single cache line). As in 
Itanium's cmp8xchgl6 instruction [9], if the first two words at 
location A match their "old" values, all words are swapped with the 
"new" values (loaded into contiguous registers). Success is detected 
by comparing old and new values in the registers. Wide-CAS is 
intended for fast update of object headers. 
3.2 TMESI Protocol 
A central goal of our design has been to maximize software flexi- 
bility while minimizing hardware complexity. Like most hardware 
TM proposals (but unlike TCC or Herlihy & Moss's original pro- 
posal), we use the processor's cache to buffer a single copy of each 
transactional line, and rely on shared lower levels of the memory 
hierarchy to hold the old values of lines that have been modified 
but not yet committed. Like TCC-but unlike most other hardware 
systems-we permit mutually inconsistent versions of a line to re- 
side in different caches. Where TCC requires an expensive global 
arbiter to resolve these inconsistencies at commit time, we rely on 
software to resolve them at acquire time. The validation portion 
of a CAS-Commit is a purely local operation (unlike TCC, which 
broadcasts all written lines) that exposes modified lines to subse- 
quent coherence traffic. 
Our protocol requires no bus messages other than those already 
required for MESI. We add two new processor messages, PrTRd 
and PrTWr, to reflect T h a d  and TStore instructions, respectively, 
but these are visible only to the local cache. We also add a "threat- 
ened" bus signal (T) analogous to the existing "shared" signal (S). 
The T signal serves to warn a reader transaction of the existence of 
a potentially conflicting writer. Because the writer's commit will be 
a local operation, the reader will have no way to know when or if it 
actually occurs. It must therefore make a conservative assumption 
when it reaches the end of its own transaction (until then the line is 
protected by the software TM protocol). 
3.2.1 State transitions 
Figure 4 contains a state transition diagram for the TMESI protocol. 
The four states on the left comprise the traditional MESI protocol. 
The five states on the right, together with the bridging transitions, 
comprise the TMESI additions. Cache lines move from a MESI 
state to a TMESI state on a transactional read or write. Once a 
cache line enters a TMESI state, it stays in the transactional part 
of the state space until the current transaction commits or aborts, 
at which time it reverts to the appropriate MESI state, indicated by 
the second (commit) or third (abort) letters of the transactional state 
name. 
The TSS, TEE, and TMM states behave much like their MESI 
counterparts. In particular, lines in these states continue to supply 
data in response to bus messages. The two key differences h e  
(1) on a PrTWr we transition to TMI; (2) on a BusRdX (bus 
read exclusive) we transition to TII. These two states have special 
behavior that serves to support speculative read-write and write- 
write sharing. Specifically, TMI indicates that a speculative write 
has occurred on the local processor; TII indicates that a speculative 
write has occurred on a remote processor, but not on the local 
processor. 
A TII line must be dropped on either commit or abort, because 
a remote processor has made speculative changes which, if com- 
mitted, would render the local copy stale. No writeback or flush is 
required since the line is not dirty: Even during a transaction, silent 
eviction and re-read is not a problem because software ensures that 
no writer can commit unless it first aborts the reader. A TMI line 
is the complementary side of the scenario. On abort it must be 
dropped, because its value was incorrectly speculated. On commit 
it will be the only valid copy; hence the reversion to M. Software 
must ensure that conflicting writers never both commit, and that if 
a conflicting reader and writer both commit, the reader does so first 
from the point of view of program semantics. Lines in TMI state 
assert the T signal on the bus in response to BusRd messages. The 
reading processor then transitions to TII rather than TSS or TEE. 
Processors executing a TStore instruction (writing processors) con- 
tinue to transition to TMI; only one of the writers will eventually 
commit, resulting in only one of the caches reverting to M state. 
Lines originally in M or TMM state require a writeback on the first 
TStore to ensure that memory has the latest non-speculative value. 
Among hardware TM systems, only TCC and FSM support 
read-write and write-write sharing; all the other schemes mentioned 
in Sections 1 and 2 use eager conflict detection. By allowing a 
reader transaction to commit before a conflicting writer acquires the 
contended object, RTM permits significant concurrency between 
readers and long-running writers. Write-write sharing is more prob- 
lematic, since only one transaction can usually commit, but may be 
desirable in conjunction with early release [ l l ] .  Note that nothing 
about the TMESI protocol requires read-write or write-write shar- 
ing; if the software protocol detects and resolves conflicts eagerly, 
the TII and TMI states will simply go unused. 
In addition to the states shown in Figure 4, the TMESI protocol 




















































































Figure 4. TMESI Protocol. Dashed boxes enclose the MESI and TMESI subsets of the state space. All TMESI lines revert to MESI states in 
the wake of a CAS-Commit or Abort. Specifically, the 2nd and 3rd letters of aTMESI state name indicate the MESI state to which to revert on 
commit or abort, respectively. Notation on transitions is conventional: the part before the slash is the triggering message; after is the ancillary 
action. "Flush" indicates that the cache supplies the requested data; "Flush'" indicates it does so iff the base protocol prefers cache-cache 
transfers over memory-cache. When specified, S and T indicate signals on the "shared" and "threatened" bus lines; an overbar means "not 
signaled". 
ALoad instruction, and cleared in response to an ARelease, CAS- 
Commit, or Abort instruction (each of these requires an additional 
processor-cache message not shown in Figure 4). Invalidation or 
eviction of an Ax line aborts the current transaction. 
ALoads serve three related roles in RTM. First, every transac- 
tion ALouds its own transaction descriptor (the word it will even- 
tually attempt to CAS-Commit). If any other transaction aborts it 
(by CAS-ing its descriptor to aborted), the first transaction is guar- 
anteed to notice immediately. Second, every hardware transaction 
ALouds the headers of objects it reads, so it will abort if a writer 
acquires them. Third, a software transaction ALouds the header of 
any object it is copying (ARelemeing it immediately afterward), to 
ensure the integrity of the copy. Note that a software transaction 
never requires more than two ALouded words at once, and we can 
guarantee that these are never evicted from the cache. 
3.2.3 State tag encoding 
All told, aTMESI cache line can be in any of 12 different states: the 
four MESI states (I, S, E, M), the five transactional states (TII, TSS, 
TEE, TMM, TMI), and the three abort-on-invalidate states (AS, AE, 
AM). For the sake of fast commits and aborts, we encode these in 
five bits, as shown in Table 2. 
T Line is (I) I is not (0) transactional 
A Line is ( I )  I is not (0) abort-on-invalidate 
MESI 2 bits: I (OO), S (Ol), E(10). or M (11) 
CIA Most recent txn committed (1) or aborted (0) 
MI1 Line islwas in TMM (1) or TMI (0) 
T A  MESI C I A  M/I 
0 0  00 - - 
0 0 1 1  0 0 
0 0  01 - - 
0 0  10 - - 
0 0 11 
0 0 11 7 
1 0  00 - - 
I 0 01 - - 
1 0  10 - - 
1 0  11 - 0 
1 0  11 - I 
0 1  01 - - 
0 1 1 0  - - 
0 1 11 
o 1 1 1  o - I 
TabIe 2. Tag array encoding. Interpretations of the bits (right) give 


























- it r . ,
l . :

















, , ll, ,















1 t- -inva1 t




At commit time, if the CAS in CAS-Commit succeeds, we first 
broadcast a 1 on the CIA bit line, and use the T bits to conditionally 
enable only the tags of transactional lines. Following this we flash- 
clear the A and T bits. For TSS, TMM, TTII, and TEE the flash clear 
alone would suffice, but TMI lines must revert to M on commit and 
I on abort. We use the C/A bit to distinguish between these: a line 
is interpreted as being in state M if its MESl bits are 11 and either 
C/A or M/I is set. On Aborts we broadcast 0 on the CIA bit line. 
3.3 Conflict Detection & Immediate Aborts 
Hardware TM systems typically checkpoint processor state at the 
beginning of a transaction. As soon as a conflict is noticed, the hard- 
ware restarts the losing transaction. Most hardware systems make 
conflicts visible as soon as possible; TCC delays detection until 
commit time. Software systems, by contrast, require that transac- 
tions validate their status explicitly, and restart themselves if they 
have lost a conflict. 
The overhead of validation, as we saw in Section 2.3, is one 
of the dominant costs of software TM. RTM avoids this overhead 
by Ahading object headers in hardware transactions. When a 
writer modifies the header, all conflicting readers are aborted by 
a single (broadcast) BusRdX. In contrast to most hardware TM 
systems, this broadcast happens only at acquire time, not at the first 
transactional store, allowing flexible policy. 
If the procesor is in user mode, delivery of the abort takes the 
form of a spontaneous subroutine call, thereby avoiding kernel-user 
crossing overhead. The current program counter is pushed on the 
user stack, and control transfers to the address specified by the most 
recent SetHandler instruction. If either the stack pointer or the han- 
dler address is invalid, an exception occurs. If the processor is in 
kernel mode, delivery takes the form of an interrupt vectored in the 
usual way. If the processor is executing at interrupt level when an 
abort occurs, delivery is deferred until the return from the interrupt. 
Transactions may not be used from within interrupt handlers. Both 
kernel and user programs are allowed to execute hardware transac- 
tions, however, so long as those transactions complete before con- 
trol transfers to the other. The operating system is expected to abort 
any currently running user-level hardware transaction when trans- 
ferring from an interrupt handler into the top half of the kernel. 
Interrupts handled entirely in the bottom half (TLB refill, register 
window overflow) can safely coexist with user-level transactions. 
User transactions that take longer than a quantum to run will in- 
evitably execute in software. With simple statistics gathering, RTM 
can detect when this happens repeatedly, and skip the initial hard- 
ware attempt. 
Unfortunately, nothing guarantees that a software transaction 
will have all of its object headers in Ahaded  lines. Moreover soft- 
ware validation at the next open operation cannot ensure consis- 
tency: because hardware transactions modify data in place, objects 
are not immutable, and inconsistency can arise among words of the 
same object read at different times. The RTM software therefore 
makes every software transaction a visible reader, and arranges for 
it to A h a d  its own transaction descriptor. Writers (whether hard- 
ware or software) abort such readers at acquire time, one by one, 
by writing to their descriptors. In a similar vein, a software writer 
A h a d s  the header of any object it needs to clone, to make sure it 
will receive an immediate abort if a hardware transaction modifies 
the object in place during the cloning operation.' 
Because RTM detects conflicts based on access to object head- 
ers only, correctness for hardware transactions does not require that 
2An immediate abort is not strictly necessary if the cloning operation is 
simply a bit-wise copy; for this it  suffices to double-check validity after 
finishing the copy. In object-oriented languages, however, the user can 
provide a class-specific clone method that will work correctly only if the 
object remains internally consistent. 
TII, TSS, TEE, or TMM lines remain in the cache. These can be 
freely evicted and reloaded on demand. Memory always has an up- 
to-date non-speculative copy of data, which it returns; lines in TMI 
state do not respond to read or write requests from the bus, thereby 
allowing readers from both hardware and software transactions to 
work with the stable non-speculative copy. When choosing lines 
for eviction, the cache preferentially retains TMI and h lines. If it 
must evict one of these, it aborts the current transaction, which will 
then retry in software. Other hardware schemes buffer both transac- 
tional reads and writes, exerting much higher pressure on the cache. 
3.4 Example 
Figure 5 illustrates the interactions among three simple concurrent 
transactions. Only the transactional instructions are shown. Num- 
bers indicate the order in which instructions occur. At the beginning 
of each transaction, RTM software executes a SetHandler instruc- 
tion, initializes a transaction descriptor (in software), and A h a d s  
that descriptor. Though the open calls are not shown explicitly, 
RTM software also executes an A h a d  on each object header at 
the time of the open and before the initial T h a d  or TStore. 
Let us assume that initially objects A and B are invalid in all - - 
caches. At atransaction T1 performs a T h a d  of object A. RTM 
software will have Ahaded  A's header into Tl 's  cache in state 
AE (since it is the only cached copy) at the time of the open. The 
referenced line of A is then loaded in TEE. When the store happens 
in T2 at 8, the line in TEE in T1 sees a BusRdX message and 
drops to TII. The line remains valid, however, and T1 can continue 
to use it until T2 acquires A (thereby aborting T1) or T1 itself 
commits. Regardless of Tl 's  outcome, The TI1 line must drop to 
I to reflect the possibility that a transaction threatening that line 
can subsequently commit. 
At QT 1 performs a TStore to object B. RTM loads B's header 
in state AE at the time of the open, and B itself is loaded in TMI, 
since the write is speculative. If T1 commits, the line will revert to 
M, making the TStore's change permanent. If T1 aborts, the line 
will revert to I, since the speculative value will at that point be 
invalid. 
At transaction T3 performs a T h a d  on object A. Since T2 
holds the line in TMI, it asserts the T signal in response to T3's 
BusRd message. This causes T3 to load the line in TII, giving it 
access only until it commits or aborts (at which point it loses the 
protection of software conflict detection). Prior to the T h a d ,  R M  
software will have Ahaded  A's header into T3's cache during the 
open, causing T2 to assert the S signal and to drop its own copy of 
the header to AS. If T2 acquires A while T3 is active, its BusRdX 
on A's header will cause an invalidation in T3's cache and thus an 
immediate abort of T3. 
Event @is similar to a, and B is also loaded in TII. 
We now consider the ordering of events a, &B, and a. 
1. El happens before E2 and E3: When T1 acquires B's header, 
it invalidates the line in T3's cache. This causes T3 to abort. T2, 
however, can commit. When it retries, T3 will see the new value 
of A from Tl 's  commit. 
2. E2 happens before El  and E3: When T2 acquires A's header, 
it aborts both T1 and T3. 
3. E3 happens before El  and E2: Since T3 is only a reader of 
objects, and has not been invalidated by writer acquires, it com- 
mits. T2 can similarly commit, if E l  happens before E2, since 
T1 is a reader of A. Thus, the ordering E3, El, E2 will allow all 
three transactions to commit. TCC would also admit this sce- 



















































, t r l
, J
.e 1' Lo 1'




















@ TStore B 
0 Tag Data @ Tag Data 
1 
T l  r l  
Tag Data 
TMI A 
f2-p- TMI A  
Figure 5. Execution of Transactions. Top: interleaving of accesses in three transactions, with lazy acquire. Bottom: Cache tag arrays at 
various event points. (OH(x) is used to indicate the header of object x.) 
Sections 1 or 2 would do so, because of eager conflict detec- 
tion. RTM enforces consistency with a single BusRdX per ob- 
ject header. In contrast, TCC must broadcast all speculatively 
modified lines at commit time. 
4. RTM Software 
In the previous section we presented the TMESI hardware, which 
enables flexible policy making in software. With a few exceptions 
related to the interaction of hardware and software transactions, 
policy is set entirely in software, with hardware serving simply to 
speed the common case. 
Transactions that overflow hardware due to the size or associa- 
tivity of the cache are executed entirely in software, while ensur- 
ing interoperability with concurrent hardware transactions. Soft- 
ware transactions are essentially unbounded in space and time. In 
the subsections below we first describe the metadata that allows 
hardware and software transactions to share a common set of ob- 
jects, thereby combining fast execution in the common case with 
unbounded space in the general case. We then describe mechanisms 
used to ensure consistency when handling immediate aborts. Fi- 
nally, we present context-switching support for transactions with 
unbounded time. 
actions currently reading the object. (The need for explicitly visible 
software readers, explained in Section 3.3, is the principal policy re- 
striction imposed by RTM. Without such visibility [and immediate 
aborts] we see no way to allow software transactions to interoperate 
with hardware transactions that may modify objects in place.) 
The least significant bit of the transaction pointer in the ob- 
ject header is used to indicate whether the most recent writer was 
a hardware or software transaction. If the writer was a software 
transaction and it has committed, then the "new" object is current; 
otherwise the "old" object is current (recall that hardware transac- 
tions make updates in place). Writers acquire a header by updating 
it atomically with a Wide-CAS instruction. To first approximation, 
RTM object headers combine DSTM-style TMObject and Locator 
fields [61.~ 
Serial numbers allow RTM to avoid dynamic memory manage- 
ment for transaction descriptors by reusing them. When starting 
a new transaction, a thread increments the number in the descrip- 
tor. When acquiring an object, it sets the number in the header to 
match. If, at open time, a transaction finds mismatched numbers in 
the object header and the descriptor to which it points, it interprets 
it as if the header had pointed to a matching committed descriptor. 
On abort, a thread must erase the pointers in any headers it has ac- 
quired. As an adaptive performance optimization for read-intensive 
4.1 Transactions Unbounded in Space 
RSTM avoids the need for WCAS by moving much of an object's meta- 
The principal metadata employed RTM are in Fig- data into the data object instance, rather than the header. In particular, it 
ure 6. The object header has five main fields: a pointer to the most arranges for the newer data object to point to the older [12]. We keep all 
recent writer transaction, a serial number, pointers to One or two metadata in the header in RTM to minimize the need for ALoaded cache 


























Figure 5. Execution of Transactions. Top: interleaving of acces es in three transactions, with lazy acquire. Bot om: Cache tag ar ays at
various event points. (OH(x) is used to indicate the header of object x.)
Sections 1 or 2 would do so, because of eager conflict detec-
tion. RTM enforces consistency with a single BusRdX per ob-
ject header. In contrast, TCC must broadcast al speculatively
modified lines at commit time.
4. RT Soft are
In the previous section we presented the TMESI hardware, which
enables flexible policy making in software. ith a few exceptions
related to the interaction of hardware and software transactions,
policy is set entirely in software, with hardware serving simply to
speed the common case.
Transactions that overflow hardware due to the size or as ocia-
tivity of the cache are executed entirely in software, while ensur-
ing interoperability with concur ent hardware transactions. Soft-
ware transactions are es ential y unbounded in space and time. In
the subsections below we first describe the metadata that al ows
hardware and software transactions to share a common set of ob-
jects, thereby combining fast execution in the common case with
unbounded space in the general case. We then describe mechanisms
used to ensure consistency when handling immediate aborts. Fi-
nal y, we present context-switching support for transactions with
unbounded time.
actions cur ently reading the object. (The need for explicitly visible
software readers, explained in Section 3.3, is the principal policy re-
striction imposed by RTM. ithout such visibility [and immediate
aborts] we se no way to al ow software transactions to interoperate
with hardware transactions that may modify objects in place.)
The least significant bit of the transaction pointer in the ob-
ject header is used to indicate whether the most recent writer was
a hardware or software transaction. If the writer was a software
transaction and it has commit ed, then the "ne " object is cur ent;
otherwise the "old" object is cur ent (recal that hardware transac-
tions make updates in place). riters acquire a header by updating
it atomical y with a Wide-CAS instruction. To first approximation,
RTM object headers combine DSTM-style TMObject and Locator
fields [6].3
Serial numbers al ow RTM to avoid dynamic memory manage-
ment for transaction descriptors by reusing them. hen starting
a new transaction, a thread increments the nu ber in the descrip-
tor. hen acquiring an object, it sets the number in the header to
match. If, at open time, a transaction finds mismatched numbers in
the object header and the descriptor to which it points, it interprets
it as if the header had pointed to a matching commit ed descriptor.
On abort, a thread must erase the pointers in any headers it has ac-
quired. As an adaptive performance optimization for read-intensive
4.1 Transactions Unbounded in Space
The principal metadata employed by RTM are illustrated in Fig-
ure 6. The object header has five main fields: a pointer to the most
recent writer transaction, a serial number, pointers to one or two
clones of the object, and a head pointer for a list of software trans-
3 RSTM avoids the ne d for WCAS by moving much of an object's meta-
data into the data object instance, rather than the header. In particular, it
ar anges for the newer data object to point to the older [12]. We ke p all




Txn-1 Descriptor Txn-2 Descriptor 







Txn-1 Descriptor Txn-2 Descriptor 




Object Header I I 
I 
- - - - - - - - - - - - ,' 
Serial Number !I 
Old Object 1- ,, - 1 , , .I 
New Object 
Software Txn I 
Reader List '- 
I - - - _ _ _ _ - -  
Figure 6. RTM metadata structure. On the left a hardware transaction is in the process of acquiring the object, overwriting the transaction 
pointer and serial number fields. On the right a software transaction will also overwrite the New Object field. If a software transaction 
acquires an object previously owned by a committed software transaction, it overwrites (Old Object, New Object) with (New Object, Clone). 
Several software transactions can work concurrently on their own object clones prior to acquire time, just as hardware transactions can work 
concurrently on copies buffered in their caches. 
applications, a reader that finds a pointer to a committed descriptor 
replaces it with a sentinel value that saves subsequent readers the 
need to dereference the pointer. 
For hardware transactions, the in-place update of objects and 
reuse of transaction descriptors eliminate the need for dynamic 
memory management within the TM runtime. Software transac- 
tions, however, must still allocate and deallocate clones and en- 
tries for explicit reader lists. For these purposes RTM employs a 
lightweight, custom storage manager. In a software transaction, ac- 
quisition installs a new data object in the "New Object" field, erases 
the pointer to any data object 0 that was formerly in that field, and 
reclaims the space for 0. Immediate aborts preclude the use of dan- 
gling references. 
4.2 Deferred Aborts 
they were asleep. Toward these ends, RTM requires that the sched- 
uler be aware of the location of each thread's transaction descrip- 
tor, and that this descriptor contain, in addition to the information 
shown in Figure 6,  (1) an indication of whether the transaction is 
running in hardware or in software, and (2) for software transac- 
tions, the transaction pointer and serial number of any object cur- 
rently being cloned. 
The scheduler performs the following actions. 
1. To avoid confusing the state of multiple transactions, the sched- 
uler executes an Abort instruction on every context switch, 
thereby clearing both T and A states out of the cache. A soft- 
ware transaction can resume execution when rescheduled. A 
hardware transaction, on the other hand, is aborted. The sched- 
uler modifies its state so that it will wake up in its abort handler 
when rescheduled. 
While aborts must be synchronous to avoid any possible data in- 2. As previously noted, interoperability between hardware and 
consistency, there are times when they should not occur. Most software transactions requires that a software transaction A h a d  
obviously, they need to be postponed whenever a transaction is its transaction descriptor, so it will notice immediately if 
currently executing RTM system code (e.g., memory manage- aborted by another transaction. When resuming a software 
ment) that needs to run to completion. Within the RTM library, transaction. the scheduler re-Ahads the descri~tor. 
code that should not be interrupted is bracketed with BEGIN-NO- 3. A software- transaction may be aborted while'it is asleep. At 
ABORT. . . END-NO-ABORT macros. These function in a manner rem- preemption time the scheduler notes whether the transaction's 
iniscent of the preemption avoidance mechanism of SymUnix [2]: status is currently active. On wakeup it checks to see if this has 
BEGIN-NO-ABORT increments a counter, inspected by the stan- been changed to aborted. If so, it modifies the thread's state so 
dard abort handler installed by RTM. If an abort occurs when the that it will wake up in its abort handler. 
counter is positive, the handler sets a flag and returns. END-NO- 
ABORT decrements the counter. If it reaches zero and the flag is set, 4. A software transaction must ALoad the header of any object it 
it clears the flag and reinvokes the handler. is cloning. On wakeup the scheduler checks to see whether that 
Transactions may perform nontransactional operations for log- object (if any) is still valid (by comparing the current and saved 
ging, profiling, debugging, or similar purposes. Occasionally these serial numbers and transaction pointers). If not, it arranges for 
must be executed to com~letion ( e . ~ .  because thev acauire and re- the thread to wake up in its handler. If so, it re-Ahads the 
> " , . 
lease an 110 library lock). For this purpose, RTM makes BEGIN- header. 
NO-ABORT and END-NO-ABORT available to user code. These rules suffice to implement unbounded software transactions 
that interoperate correctly with (bounded) hardware transactions. 
4.3 Transactions Unbounded in Time 
To permit transactions of unbounded duration, RTM must ensure 5. Conclusions and Future Work 
that software transactions survive a context switch, and that they be We have described a transactional memory system, RTM, that uses 















r Statuserial u ber
Object Header
Transaction I~ ~ -------------_ ....
erial er






IReader I Reader 2
ard are riter
- escri t r
/- ~ Status
I









xn-l escri t r











, , , ----- -
~









i . t t t t r . t l t t ti i i t f i i t j t, iti t t ti
i ri l i lds. f
i i sl , , , .
eral t r t
i s .





ti , ever, ill t
t i r li t .
li t i t, r. ,
i iti i l ,





ist c , r r.
i sly,
rr tl
t) t t . r ,
t t l _ _
... _ _ .
i i ti ]:
N_N _ t ,
r t I .
is iti , l t s. _ _
t t r. ,
it l r l g s ler.
ti l
i , fili , i , .
t pletion e.g. y q
l I/O ). , I _
_ _ _ l l .
. ransacti s e i i
it ti ,
t t t ti i





















i t ti l t , , t t
200615fJ8
col. RTM is 100% source-compatible with the R S T M  software T M  
system, providing users with a gentle migration path f rom legacy 
machines. We believe this style of hard warelsoftware hybrid con- 
stitutes the most promising path forward for  transactional program- 
ming models. 
In contrast to previous transactional hardware protocols, RTM 
1. requires only one new bus signal and no  hardware consensus 
protocol o r  extra traffic a t  commit  time. 
2. requires, for  fast path operation, that only speculatively written 
lines be buffered in the cache. 
3. falls back to software on  overflow, o r  a t  the direction of the 
contention manager, thereby accommodating transactions of 
effectively unlimited size and duration. 
4 .  allows software transactions to interoperate with ongoing hard- 
ware transactions. 
5. supports immediate aborts of remote transactions, even if their 
transactional state has overflowed the cache. 
6. permits read-write and write-write sharing, when desired by the 
software protocol. 
7. permits "leaking" of information from inside aborted transac- 
tions, fo r  logging, profiling, debugging, and similar purposes. 
8. performs contention management entirely in software, enabling 
the use of adaptive and application-specific protocols. 
We are currently nearing completion of an RTM implementa- 
tion using the GEMS SIMICSISPARC-based simulation infrastruc- 
ture [14]. In future work, we  plan to explore a variety of topics, 
including other styles of RTM software (e.g., word-based); hard- 
ware (e.g., directory-based protocols); nested transactions; gradual 
fall-back to software, with ongoing use of whatever fits in cache; 
context tags for  simultaneous transactions in separate hardware 
threads; and realistic real-world applications. 
References 
[I] C. S. Ananian, K. Asanovic, B. C. Kuszmaul, C. E. Leiserson, and 
S. Lie. Unbounded Transactional Memory. In Proc. of the 11th Intl. 
Symp. on High Performance Cornpurer Architecture, San Francisco, 
CA, Feb. 2005. 
[2] J. Edler, J. Lipkis, and E. Schonberg. Process Management for Highly 
Parallel UNIX Systems. In Proc. of the USENIX Workshop on Unir 
and Supercomputers, Pittsburgh, PA, Sept. 1988. 
[3] K. Fraser and T. Harris. Concurrent Programming Without Locks. 
Submitted for publication, 2004. Available as research.microsoft.com/ 
-thanis/draftslcpwl-submission.pdf. 
[4] R. Guerraoui, M. Herlihy, and B. Pochon. Polymorphic Contention 
Management in SXM. In Proc. of the 19th Intl. Symp. on Distributed 
Computing, Cracow, Poland, Sept. 2005. 
[5] L. Hammond, V. Wong, M. Chen, B. Hertzberg, B. Carlstrom, M. 
Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun. Transactional 
Memory Coherence and Consistency. In Proc. of the 31st Intl. Symp. 
on Computer Architecture, Miinchen, Germany, June 2004. 
[6] M. Herlihy, V. Luchangco, M. Moir, and W. N. Scherer 111. Software 
Transactional Memory for Dynamic-sized Data Structures. In Proc. of 
the 22ndACM Symp. on Principles of Distributed Computing, Boston, 
MA, July 2003. 
[7] M. P. Herlihy and J. M. W~ng. Linearizability: A Correctness 
Condition for Concurrent Objects. ACM Trans. on Programming 
Languages and Systerns, 12(3):463492, July 1990. 
[8] M. Herlihy and J. E. Moss. Transactional Memory: Architectural 
Support for Lock-Free Data Structures. In Proc. of the 20th 
Intl. Symp. on Computer Architecture, San Diego, CA, May 1993. 
Expanded version available as CRL 92/07, DEC Cambridge Research 
Laboratory, Dec. 1992. 
[9] Intel Corporation. Intel Itanium Architecture Software Developers 
Manual. Revision 2.2, Jan. 2006. 
[lo] S. Kumar, M. Chu, C. J. Hughes, P. Kundu, and A. Nguyen. Hybrid 
Transactional Memory. In Proc. of the I Ith ACM Symp. on Principles 
and Practice of Parallel Programming, New York, NY, Mar. 2006. 
[I I] V. J. Marathe, W. N. Scherer 111, and M. L. Scott. Adaptive Software 
Transactional Memory. In Proc. of the 19th Intl. Symp. on Distributed 
Computing, Cracow, Poland, Sept. 2005. 
[I21 V. J. Marathe, M. F. Spear, C. Heriot, A. Acharya, D. Eisenstat, W. N. 
Scherer 111, and M. L. Scott. Lowering the Overhead of Software 
Transactional Memory. In ACM SIGPLAN Workshop on Languages, 
Compilers, and Hardware Support for Transactional Computing, 
Ottawa, ON, Canada, July 2006. Held in conjunction with PLDI 
2006. Expanded version available as TR 893, Dept. of Computer 
Science, Univ. of Rochester, Mar. 2006. 
[I31 J. F. Martinez and J. Torrellas. Speculative Synchronization: Applying 
Thread-Level Speculation to Explicitly Parallel Applications. In Proc. 
of the 10th Intl. Conf: on Archirecrural Support for Programming 
Languages and Operating Systems, San Jose, CA, Oct. 2002. 
[I41 M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. 
Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. 
Multifacets General Execution-driven Multiprocessor Simulator 
(GEMS) Toolset. In ACM SIGARCH Computer Architecture News, 
Sept. 2005. 
[15] M. M. Michael. Scalable Lock-Free Dynamic Memory Allocation. In 
Proc. of the SIGPLAN 2004 Conf: on Programming Language Design 
and Implemenrarion, Washington, DC, June 2004. 
[16] M. M. Michael and M. L. Scott. Simple, Fast, and Practical Non- 
Blocking and Blocking Concurrent Queue Algorithms. In Proc. 
of the 15th ACM Symp. on Principles of Distributed Computing, 
Philadelphia, PA, May 1996. 
[17] M. Moir. Hybrid Transactional Memory. Unpublished manuscript, 
Sun Microsystems Laboratories, Burlington, MA, July 2005. 
[18] K. E. Moore, J. Bobba, M. J. Moravan, M. D. Hill, and D. A. Wood. 
LogTM: Log-based Transactional Memory. In Proc. of the 12th Intl. 
Symp. on High Performance Computer Architecture, Austin, TX, Feb. 
2006. 
[I91 R. Rajwar and J. R. Goodman. Speculative Lock Elision: Enabling 
Highly Concurrent Multithreaded Execution. In Proc. of the 34th Inrl. 
Symp. on Microarchirecture, Austin, TX, Dec. 200 I .  
[20] R. Rajwar and J. R. Goodman. Transactional Lock-Free Execution of 
Lock-Based Programs. In Proc. of the 10th Intl. Conf: on Architectural 
Support for Programming Languages and Operating Systerns, San 
Jose, CA, Oct. 2002. 
[21] R. Rajwar, M. Herlihy, and K. Lai. Virtualizing Transactional 
Memory. In Proc. of the 32nd Intl. Symp. on Computer Architecture, 
Madison, WI, June 2005. 
[22] W. N. Scherer 111 and M. L. Scott. Contention Management in 
Dynamic Software Transactional Memory. In Proc. of the ACMPODC 
Workshop on Concurrency and Synchronization in Java Programs, St. 
Johns, NL, Canada, July 2004. 
[23] W. N. Scherer I11 and M. L. Scott. Advanced Contention Management 
for Dynamic Software Transactional Memory. In Proc. of the 24th 
ACM Symp. on Principles of Distributed Computing, Las Vegas, NV, 
July 2005. 
[24] H. Sundell and P. Tsigas. NOBLE: A Non-Blocking Inter-Process 
Communication Library. In Proc. of rhe 6th Workshop on Languages, 
Compilers, and Run-time Systerns for Scalable Computers, Washing- 
ton, DC, Mar. 2002. Also TR 2002-02, Chalmers Univ. of Technology 
and Goteborg Univ., Goteborg, Sweden. 
[25] R. K. Treiber. Systems Programming: Coping with Parallelism. RJ 
51 18, IBM Almaden Research Center, Apr. 1986. 
l. is r - ti l it t ft r
ste , r i i r it tl i r ti t l
i . li t i t l / t i
tit t s t t i i t t ti l
i l .
t t t i t ti l t l , f
I. l l
r t l r tr tr ffi t it ti .
. i , r t ti ,
li i t .
. f lls t
t ti , t ti t ti
ff ti l
. ll s ft r r t
r t .
. rts t t
t ti l t t .
. its i ,
ft r t l.
. it "





t [ ]. t r
i l t l s f
r .
f t ,
t r i lt s
t
f
[I] . . i . . . l, . .
. ie. r . . f ti.
y p. i ce mp ter i
, . .
[2] . l r, . i i , . r .
r llel t . . rkshop ix
perc puters, r , , t. .
[3] . . rris.
itt r . l l
"1 rri ts/cpwl-subm sion.pdf.
[ ] . rr i, . rli , .
t i . c. f t t ti. . i t i ted
puting, , t.
[5] . , . , . , . t , . l t , .
r , . ij , . i , . l t . ti l
r i t . . f ti. .
t it , , , .
[6] . rlihy, . , . ir, . III.
r ti l r r i i t t t . . f
t e p. i i l i t i t ti g, t ,
, J l .
[ ] . . rli . . ing. i ri ilit : rr t ss
iti f r rr t j t . r ns. i
a s t ms, (3):463- , l 0.
[8] . erli a J. . ss. ra sacti al e r : rc itect ral
rt f r - r t tr t r . I . f t t
I ti. p. t it t , i , , .
10
a e ersi a aila le as / , a ri e esearc
a rat r , ec. .
[ ] I t l r r ti . I t l It i r it t r ft r l rs
l. isi . , J . .
[10] . r, . , . J. s, . , . . ri
r s ti l r . I . f t 11t . i i l s
ti f ll l i , r , , r. .
. . , . . III, . . i
r ti l r . I . f t t I ti. . istributed
ti g,
1 ] . . . .
III, . . tt. i t f t
p
il , t f r ti l ti ,
, , , .
f t r
, . , . .
13]J. . l i g
ti .




ss r t r
] i .











1 ] . li
f t I ti
. t r ,
] . i f
f i. ! t l
r ti m ,
] ti l
. . f i. . ,
, , .
] . III i t i
i t ti l . . ft P
s op
, , , l .
[ ] . . r r III . . tt. t ti ent
f r i ft r r ti l r . I . f t t
. i i l s f i tributed ti , , Y,
l .
[ ] . ll . i . : - l i I t r- r
i ti i . . f t t op ,
il , ti t m f r l l t , i
t , , r. . ls - , l rs i . f l
iit r i ., iit r , .
[ ] . . r i r. t r r i : i it r ll li . 1
, I l r t r, r. .
2006/5/18
Extending Hardware Transactional Memory to Support 
Non-busy Waiting and Non-transactional Actions 
Craig Zilles Lee Baugh 
Computer Science Department 
University of Illinois at Urbana-Champaign 
[zilles,leebaugh]@cs.uiuc.edu 
ABSTRACT 
Transactional Memory (TM) is a compelling alternative to 
locks as a general-purpose concurrency control mechanism, 
but it is yet unclear whether TM should be implemented as a 
software or hardware construct. While hardware approaches 
offer higher performance and can be used in conjunction with 
legacy languages/code, software approaches are more flexible 
and currently offer more functionality. In  this paper, we try 
to bridge, in part, the functionality gap between software and 
hardware TMs by demonstrating how two software TM ideas 
can be adapted to work in a hardware TM system. Specif- 
ically, we demonstrate: 1) a process to efficiently support 
transaction waiting - both intentional waiting and waiting 
for a conflicting transaction to complete - by de-scheduling 
the transacting thread, and 2) the concept of pausing and 
an  implementation of compensation to allow non-idempotent 
system calls, I/O, and access to high contention data within 
a long-running transaction. Both mechanisms can be imple- 
mented with minimal extensions to an  existing hardware TM 
proposal. 
1. INTRODUCTION 
While the industry-wide shift t o  multi-core processors pro- 
vides an effective way to exploit increasing transistor den- 
sity, it introduces a serious programming challenge into the 
mainstream; even expert programmers find it difficult to 
write reliable, high-performance parallel programs, with much 
of this difficulty resulting from the available primitives for 
managing concurrency. The problems with locks, presently 
the dominant primitive for managing concurrency, are well 
documented (e.g., [24]): they don't compose, they have a 
possibility for deadlock, they rely on programmer conven- 
tion, and they represent a trade-off between simplicity and 
concurrency. 
Transactional Memory (TM) [I ,  8, 9, 10, 11, ?, 18, 221 
has been identified as a promising alternative approach for 
managing concurrency. TM addresses a number of the prob- 
lems with locks by providing an efficient implementation of 
atomic blocks [15], code regions that must (appear to) not be 
interleaved with other execution. Atomic blocks. or trans- 
actions as the recent literature calls them, simplify concur- 
rent programming because, while the programmer must still 
identify critical sections (where shared state is not consis- 
tent), they need not be associated with any synchronization 
variable. By using an optimistic approach to  concurrency 
(i.e., speculate independence and rollback on a conflict), 
concurrency need only be limited by data dependences, lead- 
ing to even better performance than fine-grain locking in 
some cases. 
Since the introduction of Transactional Memory, devel- 
opment of TM systems has gone in two distinct directions. 
First, researchers have explored to what degree tiansactional 
memory can be implemented efficiently without hardware 
support. In this process, these software transactional mem- 
ory (STM) systems have been extended to support addi- 
tional software primitives, further increasing the power of 
the programming model. Concurrently, research in hard- 
ware transactional memory (HTM) has yielded approaches 
that avoid exposing hardware implementation details (e.g., 
cache size, associativity) to  the programmer, but generally 
without extending the programming model. 
In this paper, we show that a number of the extensions 
developed in the context of STMs can be incorporated into 
HTMs, and that doing so can be inexpensive, in that it does 
not require significant extensions to existing HTM propos- 
als. In this paper, we focus on the Virtual Transactional 
Memory (VTM) proposal from Rajwar et al. [22]. We pro- 
vide background about VTM in Section 2, discussing its 
salient features and how our im~lementation differs from its 
original proposal. 
We focus on incorporating two STM features. First, in 
Section 3, we show how an HTM can cooperate with a soft- 
ware thread scheduler to avoid having transactions busy- 
wait for long periods of time. This has two applications: 1) 
stalling one transaction while it waits for a conflicting trans- 
action t o  commit, and 2) using transactions to intentionally 
wait on multiple variables, much in the manner of the Unix 
system call select 0. We find that the additional required 
hardware support is limited to raising exceptions to  transfer 
control to software under certain transaction conflicts. 
Second, we demonstrate how support for non-transactional 
actions can be included within transactions (Section 4). This 
too has two main applications: 1) avoiding contention re- 
sulting from accessing frequently modified variables within 
a long transaction, and 2) performing 110 or system calls in 
the middle of transactions. The only required hardware ex- 
tension is the ability to  pause a transaction without pausing 
the thread's execution, which requires an additional mode 
for transactions and two new primitives for pausing and 
unpausing. With transactional pause in place, we demon- 
strate how a non-idempotent system call, mmap0, can be 
supported in a hardware transaction using a software-only 
framework for compensating actions. 
In Section 5, we discuss concurrent work to extend HTM's 
























. ., 4]): ' ,
i ilit l , l
, f
rr .
) 1, , , , , , ]
.
i t t
l 5], i )
ti . ,
ll , li r-
ll
l t i -
t),

























t r s t .
Figure 1: Virtual Transactional Memory. a) transaction reaq 
transition diagram. 
2. VIRTUAL TRANSACTIONAL MEMORY 
While small transactions can be supported by the cache 
and coherence protocol, large transactions require spilling 
transaction state to memory. In particular, if we want trans- 
actions to survive a context switch, we cannot rely on any 
structures related with a particular processor, including the, 
cache, coherence state, or per-processor in-memory data 
structures. Rather, the bulk of the transaction state (the 
read and write sets) must be held in (virtual) memory where 
it can be observed by any potentially conflicting thread. 
In VTM. transaction read and write sets are maintained 
in a centralized data structure called the transactional ad- 
dress data table (XADT) shown in Figure la. This data - 
structure is shared by all of the threads within an address 
space; for the sake of performance isolation - the degree 
to which the system can prevent the behavior of one a p  
plication from impacting the performance of others [27, 281 
- each virtual address space is allocated its own XADT. 
Each entry in the XADT stores the address, control state 
(valid, readlwrite), data, and a pointer to a transactional 
status word (XSW). Each transacting thread has its own 
XSW, which holds the transaction's current state. Because 
the same XSW is pointed to by all of a transaction's XADT 
entries, a transaction can be logically committed or aborted 
with a single update to an XSW. 
In VTM. a transaction can be in anv of seven states. as 
shown in Figure lb.  When a transaction begins, a tian- 
sition is made from non-transactional (NonT) to running, 
active, local (RAL) where the transaction is held in cache, 
and abort/commit can be handled in hardware with a tran- 
sition back to NonT. When the transaction's footprint gets 
too large, a transition is made to running, active, overflowed 
(RAO). Upon this transition, the transaction must incre- 
ment the XADT's associated overflow count, which signals 
to other potentially conflicting threads that they must probe 
the XADT. In order to prevent unnecessary searches of the 
XADT, VTM provides the transaction filter (XF), a count- 
ing Bloom filter that can be checked prior to accessing the 
XADT that conservatively indicates when an XADT access 
is unnecessary. 
From the RAO state, a transaction's XADT entries may 
be marked as committed or aborted via transitions to com- 
mitted, active, overflowed (CAO) and aborted, active, over- 
flowed (BAO), respectively. When the physical commit/abort 
has completed, by removing the related entries from the 
XADT. the XSW can be transitioned back to NonT and the 
overflow counter decremented. The physical commit/abort 
d/write sets are stored in a central XADT; b) VTM transaction state 
can potentially be performed lazily - handling committed 
and aborted XADT entries as they are encountered - and 
in parallel with the thread's further execution (by allocating 
the thread a new XSW). 
If an interrupt, exception, or trap is encountered, a run- 
ning transaction (RAL, RAO) is transitioned to the running, 
swapped, overflowed (RSO) state where it no longer adds to 
the transaction's readlwrite sets. If a transaction is aborted 
while it is swapped out, it moves to the aborted, swapped, 
overflowed (BSO) state, and the abort is handled when it is 
swapped back in (the BAO state). 
2.1 Simulated Implementation 
Our variant of VTM was implemented through extensions 
to the x86 version of the Simics full-system simulator [16] 
and the Linux kernel, version 2.4.18. The primary differ- 
ence in our implementation from Rajwar et al.'s descrip- 
tion [22] is that, like LogTM [18], we use eager versioning: 
we allow transaction writes to speculatively update memory 
after logging the architected values. The VTM hardware 
was emulated by a Simics module that monitored memory 
traffic and could be controlled by software through new in- 
structions implemented using Simics' magic instruction, a 
nop (xchg %bx,%bx) recognized by the simulator. Although 
no performance results are included in this paper, we have 
subjected our implementation to torture tests meant to ex- 
pose unhandled race conditions, giving us some confidence 
that our implementation (and hence this text) addresses the 
salient issues. 
While VTM could be implemented as an almost entirely 
user-mode construct, doing so would rely on the existence of 
user-mode exception handling. Because x86 currently does 
not have a user-mode exception handling mechanism, our 
implementation uses the existing kernel-mode exceptions, 
and much of the software stack associated with VTM is im- 
plemented as part of the Linux kernel. Also, our VTM im- 
plementation uses locks in its implementation (so that it 
doesn't depend on itself), but its critical sections could ex- 
ploit a technique like speculative lock elision [21]. 
In keeping with the spirit of VTM, we wanted to mini- 
mally impact the execution of processes that are not using 
transaction support. To this end we add only two new reg- 
isters that must be set on a context switch, add less than 
100 bytes of process state, and add two instructions to the 
system call path. All other kernel modifications are only 
encountered by transacting processes. 
The VTM hardware/software interface is embodied by 
two main data structures, shown in Figure 2. The global 
a) XADT Overflow Count = 4 b)
T r 0><060000 &xsw1 spec. data
T r 0><060020 &xsw1 spec. data
T r 0><060044 &xsw1 spec. data
T r 0><054010 &xsw2 spec. data
F
T r 0><054030 &xsw2 spec. data


























































typedef s t r u c t  global-xact-state-s { 
i n t  overf low-count ; 
xadt -entry-t *xadt ; 
/************* t h e  following f i e l d s  a r e  software only ************/ 
i n t  next-transaction-num; / /  f o r  uniquely numbering LTSSs 
spinlock-t gtss-lock; / /  guards the  a l locat ion of GTSS f i e l d s  
spinlock-t xact-waiter-lock; / /  guards modification of waiter  f i e l d s  
1 global-xact-state-t; 
typedef s t r u c t  local-xact-state-t  C 
xsw-type-t xsw; 
i n t  transaction-num; / /  f o r  resolving conf l i c t s  
x86-reg-chkpt-t *reg-chkpt; 
comp-lists-t *camp-lists; // discussed i n  Section 4 
/**** t h e  following a re  software only f i e l d s ,  described i n  Section 3 ****/ 
s t r u c t  transaction-state-s *waiters;  
s t r u c t  t ransact ion-s ta te-s  *waiter-chain-prev; 
s t r u c t  transaction-state-s *waiter-chain-next; 
s t r u c t  task-s t ruct  *task-struct  ; 
local -xact -s ta te- t ;  
Figure 2: Data structures for the global and local transactional state segments (GTSS and LTSS, respectively). 
transaction state segment (GTSS) holds the overflow count, 
and a pointer to  the XADT. In addition, our kernel allo- 
cates additional state for its own use (also discussed below). 
The local transaction state segment (LTSS) holds the XSW, 
a transaction priority for resolving conflicts, a pointer to 
storage for a register checkpoint, and additional fields dis- 
cussed in Sections 3 and 4. The kernel allocates one GTSS 
per address space (as part of mm-struct) and LTSSs on a 
per thread (or, in Linux terminology, task) basis. Pointers 
to  these data structures are written into the two registers 
(the GTSR and LTSR, respectively) on a context switch. 
To meet our goal of minimally impzting non-transacting 
processes, we delay allocation of data structures until they 
are required. Specifically, large structures (e.g., the XADT) 
and per thread structures (e.g., the LTSS) are allocated on 
demand; if a thread tries to execute a transaction-begin and 
its LTSR holds a NULL, the processor throws an exception 
whose handler allocates the LTSS, as well as an XADT if 
necessary. The gtss-lock is used t o  prevent a race condition 
where multiple threads try to  allocate XADTs. The only 
structure not allocated on demand is the GTSS, because (in 
our implementation) even threads that are not transacting 
need to  monitor the overf low-count field. By allocating the 
GTSS at  process creation time, we avoid having to notify 
other threads (via interprocessor interrupt) that they need 
t o  update their GTSR. Since the GTSS contains only a few 
scalars and pointers, it results in a small per-process space 
overhead. 
For simplicity, all of the small structures (e.g., GTSS, 
LTSS) are allocated to  pinned memory (i.e., not swapped) 
t o  avoid unnecessary page faults. For performance isolation 
reasons, large structures (e.g., the XADT) are allocated in 
the process's virtual memory address space. If executing 
an instruction requires access to XADT data not present in 
physical memory, the VTM hardware causes the processor 
to raise a page fault. After servicing the page fault - we 
made no modifications to  the page fault handling code - 
the operation can be retried. 
3. DE-SCHEDULING TRANSACTIONS 
While VTM provides support for swapping out threads 
without aborting their running transactions (and continu- 
ing their execution on another processor), this support was 
intended to handle swapping that results from conventional 
system activity (e.g., timer interrupts). In this section, we 
discuss how the VTM system can coordinate with a software 
scheduler to support de-scheduling/re-scheduling processes 
based on VTM actions. We present two cases: first, we 
demonstrate how a transaction conflict can be resolved by 
de-scheduling one thread until the other thread's transac- 
tion either commits or aborts. Second, we show how Har- 
ris et a1.k intentional wait primitive r e t r y  can be imple- 
mented in an HTM like VTM. 
3.1 De-scheduling Threads on a Conflict 
A conflict does not necessitate aborting a transaction, 
an observation made in previous transactional memory sys- 
tems [18, 201 and earlier in database research [23]. In partic- 
ular, the conflict is asymmetric: when two transactions con- 
flict, one of them (which we call T I )  already owns the data 
(i.e., it belongs to  the transaction's memory footprint) and 
the other transaction (T2) is requesting the data for a con- 
flicting access, as shown in Figure 3. By detecting conflicts 
eagerly (i.e., when they occur rather than at  transaction 
commit time) we can prevent the conflict from taking place 
by stalling transaction T2. For short-lived transactions, 
stalling T2 briefly can allow T1 to commit (or abort) a t  
which point T2 can continue. If T I  does not commit/abort 
quickly, we need to  resolve the conflict. This conflict can be 
resolved in many ways (e.g., [12]). If T 2  is selected as the 
"winner," then T1 must be aborted to  allow T2 to proceed. 
In contrast. if T 1  "wins." T2 can either be aborted or fur- 
ther stalled; provided thk conflict resolution is repeatable so 
as to avoid deadlock. 
If T I  is a long running transaction, T2 may be stalled for 
a significant time, unnecessarily occupying a processor core. 
This situation corresponds to  the case in a conventionally 
synchronized critical section where a lock is spinning for a 
long time. In this section, we demonstrate how our system 
can be extended to  allow such stalled transactions to be 
de-scheduled until T1 commits/aborts, in much the same 
way that a down on a unavailable semaphore de-schedules a 
_xact_state_s
l _ unt
_ t _t ;
1 i l s
_transaction_num;
_t _l ;













_ _s _ _pre ;
_state_s _ i _next;




















































Figure 3: The asymmetric nature of transaction conflicts. Transaction T1 added the data item D to its memory footprint, then 
transaction T2 tried to access that data in a conflicting way. 
a, 
E .- .,- 
Figure 
represer 
TI  T2 access t v ~ e  
accesses D T2 conflict 
(successfully) 
tries to 
L: The  responsibility for waking up  de-scheduled processes is maintained by linking the  LTSSs. 
NULL pointers. Each LTSS includes a pointer to the task-struct for waking the thread. 
thread. In the description that follows, we describe an oper- 
ating system-based implementation that uses the traditional 
x86 exception model. The same approach could be imple- 
mented completely in user-mode, with a user-mode thread 
scheduler and user-mode exceptions [25]. 
In order to de-schedule a thread on a transaction conflict, 
we need to  communicate a microarchitectural event up to 
the operating system. We implement this communication 
by having T2 raise an xact-wait exception, whose handler 
marks T2 as not available for scheduline: and calls the sched- - 
uler. The only challenging aspect of the implementation is 
ensuring that T2 is woken up when T 1  commits or aborts. 
For T1 to perform such a wakeup, it needs to  know two 
things: 1) that such a wakeup is required, and 2) who to 
wake up. The first requirement is achieved by setting a bit 
(XSWXXCEPT) in T l ' s  XSW to indicate that a xact-completion 
exception should be raised when the transaction commits or 
aborts. The second requirement is achieved by building a 
(doubly-) linked list of waiters; we use the LTSSs (recall 
Figure 2) as nodes to  avoid having to  allocate/deallocate 
memory, as shown in Figure 4. We also include in the 
LTSS a pointer to the thread's task-struct ,  which holds 
the thread's scheduling state. 
Code for the xact-wait exception handler is shown in Fig- 
ure 5; we used conventionally synchronized code, but this 
would be an ideal use for a (bounded) kernel transaction. 
As part of raising the exception, T2's processor writes the 
address of T l l s  LTSS to a control register (cr2). A key fea- 
ture is our transferral of the responsibility of waking up T2 
from itself to  TI .  In particular, we don't want to  transfer 
responsibility if T1 has already committed or aborted. By 
doing a compare-and-swap on T l ' s  XSW, we can know that 
T 1  was still running when we set the XSWXXCEPT flag, and, 
therefore, that responsibility has been transferred. Now, 
T 1  will except on commit/abort. In the xact-completion 
exception handler (not shown), it acquires the same lock, 
ensuring that it will find node T2 inserted in its waiter list. 
Shaded fields 
The only remaining race condition is one that can re- 
sult from T1 committing and recycling its XSW for another 
transaction between the conflict and the xact-wait exceD- 
tion executing. This is not a problem in our implementation 
that only slowly recycles XSWs. If this were a problem, it 
could be handled by either having the VTM unit monitor 
T l ' s  XSW (via the cache coherence protocol) or by using 
sequence numbers, but space limitations preclude a detailed 
discussion. 
3.2 Implementing an Intentional Wait 
In their software TM for Haskell, Harris et al. propose a 
particularly elegant primitive for waiting for events, called 
r e t r y  [9]. The r e t r y  primitive enables waiting on multi- 
ple conditions, much like the POSIX system call s e l ec t  or 
Win32's WaitForMult ipleOb j ec t s ,  but in a manner that 
supports composition. Its use is demonstrated by the code 
example in Figure 6, which selects a data item from the first 
of a collection of work lists that has an available data item. 
If all of the lists are empty, then the code reaches the r e t r y  
statement, which conceptually aborts the transaction and 
restarts it at  the beginning. 
However, as Harris et al. rightly point out, "there is no 
point to actually re-executing the transaction until at  least 
one of the variables read during the attempted transaction 
is  written by another thread." Because the locations read 
have already been recorded in the transaction's read set, we 
can put the transacting thread to sleep until a conflict is 
detected with another executing thread. 
Doing so in the context of our VTM implementation re- 
quires a modest modification to  the described system. Specif- 
ically, two pieces of additional functionality are required: 
1) a software primitive is required that allows a transac- 
tion to communicate its desire to wait for a conflict, and 2) 
when another thread aborts a transaction that is waiting, 









access D write write yes
'-- conflict!











































asmlinkage void xact-wait-except(struct pt-regs * regs,  long error-code) ( 
// puts t h i s  thread t o  sleep waiting f o r  T1 t o  abort or commit 
s t ruc t  task-struct *tsk = current;  // get pointer t o  current task-struct 
xact-local-state-t *TI, *T2, *T3; 
xsw-state-t Tl-xsw; 
--asm--("movl %%crZ,%O":"=r" (TI)) ;  / /  get p t r  t o  winner's (TI) xact s t a t e  
T2 = t sk -> thread . l t s r ;  / /  get p t r  t o  our (T2) xact s t a t e  
tsk->sta te  = TASK-UNINTERRUPTIBLE; / /  deschedule t h i s  thread 
spin-lock(&tsk->mm->context.xact~waiter~lock); / /  get per address-space lock 
do 1 
i f  ((Tl-xsw = TI->xsw) & (XSW-ABORTINGIXSW-COMMITTING)) ( / /  already done 
spin-unlock(&tsk->mm->context.xact~waiter~lock); 
t sk->sta te  = TASK-RUNNING; 
re turn;  
> > while ( ! compare-and-swap (&TI->xsw, Tl-xsw , Tl-xsw I XSW-EXCEPT) ) 
T3 = TI->waiters; 
TI->waiters = T2; / /  inse r t  in to  doubly-linked l is t  
T2->waiter-chain-prev = TI; 
i f  (T3 ! = NULL) 1 
T3->waiter-chain-prev = T2; 





Figure 5: Code for de-scheduling a thread on a transaction conflict. In this implementation, a per-address space spin lock is 
used to ensure the atomicity of transferring to T1 the responsibility for waking up T2. 
element *get-element-to-process() ( 
TRANSACTION-BEGIN; 
f o r  ( in t  i = 0 ; i < NUM-LISTS ; ++ i )  ( 
i f  ( l i s t  [il . has-element 0 ( 
element *e = l i s t  [i] . get-element 0 ; 
TRANSACTION-END; 
re turn e ;  
> 
> 
re t ry ;  
> 
Figure 6: An illustrative example demonstrating the  use 
of retry. Retry enables simultaneously waiting on multiple con- 
ditions (multiple lists in this case); conceptually, the transaction 
is aborted and re-executed when the retry primitive is encoun- 
tered. 
Our implementation provides the first primitive with an 
instruction that raises a r e t ry  exception. In the exception 
handler (not shown), the process is blocked, the transac- 
tion's priority is set to a minimum value (so that it will 
always be aborted when a conflict occurs), and it marks 
its XSW with a XSWRETRY bit indicating that a conflict- 
ing thread is responsible for waking up this sleeping thread. 
As above, a compare-and-swap is used to set this bit, so 
the software knows that the XSW was not already marked 
as aborted. If the transaction has already been aborted, 
the thread is set back to state TASKRUNNING and the pro- 
cess returns from the exception. Otherwise the handler calls 
schedule0 to find an alternate thread to schedule on this 
processor. 
When a thread aborts a transaction with the XSWRETRY 
bit set, it completes the current instruction, copies the XSW 
address of the aborted thread to a control register ( c r ~ ) ,  
and raises a retry-wakeup exception. This exception han- 
dler reads the task-struct field from the aborted transac- 
tion's LTSS and wakes up the thread using try-to-wakeup 
Also, a potential race condition exists that requires adding 
a check to  the code in Figure 5 to verify that the transac- 
tion is not waiting on a retrying transaction, before it calls 
schedule 0. 
4. PAUSING TRANSACTIONS TO MITIGATE 
CONSTRAINTS 
In the previous section, we discussed dealing with conflicts 
efficiently. In this section, we consider how pausing a trans- 
action (without pausing the thread's execution) can be used 
to avoid conflicts for data elements with high contention, 
as well as allow actions with non-memory-like semantics to 
be performed within transactions. While a transaction is 
paused, its thread is allowed to perform any action, includ- 
ing system calls and 110, and its memory operations are 
not added to the transaction's footprint. We begin this 
section with an illustrative example and conclude with a 
collection of dynamic memory allocator-based examples to 
demonstrate the benefit and use of pausing transactions. 
4.1 A Simple Example: Keeping Statistics 
In Figure 7a, we show a transaction that increments a 
global counter to maintain statistics. Such code can be 
problematic, because transactions that are otherwise inde- 
pendent may conflict on updates to this statistic. While 
as linkage i act_ ait_except(struct t_re s * re s, l err r_code) {
II t t is t r to l iti f r t rt r c it
t t t _ tr t t rr t; II t i t r t rr t t s _str ct
t_l l_state_t 1, , ;
xs _state_t 1_xs ;
__as __ ("movl 'l.cr2, %0" : "=r" ( t)); I I et tr t i er's ( 1) act st t
t t r .ltsr; II t tr t r ( ) t t t
t t t _ I I ; II l t i t
in_lock(&tsk-> ->context. act_ ait r_lo ); II t r r -s ace l
do {
i (( 1_xs 1- s ) (XS _ RTI I _CO MI TING)) { II lr
in_u lock( t -> ->context.xact_ ait r_loc );
t t t _ I ;
t r ;
}
} il (!compare_and_ (&T1->xsw, 1_xsw, 1_xs IXSW_ PT))
1 ;
1 it rs ;
it _ _prev 1;
{
_ i _prev ;






i : tati n,
t re icit l i il
t_ _to_proce s() {
I _










t l s lt sl
iti ultiple t se); t ally,










i t S ...RU























b) xact-begin %t I 3  (try transaction) a) ... 
transaction { 
. . . 
. . . 
++ statistic; 
increment statistic atomically (using CAS) 
register compensation action 
(perform compensation) 
decrement statistic atomically (using CAS) 
xact-begin deallocate compensation data 
(retry transaction) 
transactional non-transactional 
Figure 7: Incrementing statistics using pausing and compensation when precise intermediate value is not required. a) A 
"hot" statistic is incremented within a transaction, b) conflicts can be avoided by pausing before incrementing (using a compare-and-swap) 
the statistic and performing compensation if the transaction aborts. 
seemingly trivial, such statistics impact the scalability of 
existing hardware TMs [5]. The problem derives from the 
fact that the TM is providing a stronger degree of atomic- 
ity than the application requires: while the statistic's final 
value should be precise, an approximate value is generally 
sufficient while execution is in progress. 
We can exploit the reduced requirements for atomicity, by 
non-transactionally performing the increment from within 
the transaction. Note that this is not an action automati- 
cally performed by a compiler, but, rather, one performed 
by a programmer to  tune the performance of their code. 
In Figure 7b, we sketch an implementation that pauses the 
transaction before performing the counter update, so that 
the counter is not added to the transaction's read or write 
sets. To preserve the statistic's integrity, we also register a 
compensation action - to  be performed if the transaction 
aborts - that decrements the counter. Such an implemen- 
tation achieves the application's desired behavior without 
unnecessary conflicts between transactions. An alternative 
implementation could just register an action to be performed 
after the transaction commits that increments the counter. 
In the next subsection, we describe the necessary implemen- 
tation mechanisms. 
4.2 Transaction Pause Implementation 
Hardware-wise, implementing the transaction pause is quite 
straightforward; it is simply another bit that modifies the 
XSW state. We add two new instructions xact-pause and 
xact-unpause, which set and clear this bit, respectively. 
As previously noted, when a transaction is paused, ad- 
dresses loaded from or stored to are not added t o  the trans- 
action's read and write sets (i. e., no entries are added to  the 
XADT). Instead concurrency must be managed using other 
means (e.g., the use of compare-and-swap instructions to up- 
date the statistic). Nevertheless, we check for conflicts with 
transactions, just as if we were executing non-transaction 
code. The one exception is that we should ignore conflicts 
with the thread's own paused transaction. I t  is not uncom- 
mon to want to  pass argumentslreturn values between the 
transaction and the paused region, and some of these may 
the transaction is aborted. We would like just to remove 
the written region from the transaction's write set, but the 
granularity at  which the write set is tracked may prevent 
this. We have implemented this case by causing such stores 
to write both to  memory and the associated XADT entry, 
so that the write is preserved on an abort. In many re- 
spects, the semantics of performing writes in paused regions 
resemble the previously proposed open commit [19]; while 
pausing is, in some ways, a weaker primitive than open com- 
mit (transaction semantics are not provided in the paused 
region), in other ways it is more powerful (non-memory-like 
actions can be performed). Furthermore, pause is simpler to  
implement, because support for true nesting, which in turn 
requires supporting multiple speculative versions for a given 
data item, is not required. 
Because the actions within a paused region will not be 
rolled back if the transaction aborts, it may be necessary to 
perform some form of compensation [6, 7, 13, 261 to  function- 
ally undo the effects of a paused region. As such, we allow a 
thread to  register a data structure that includes pointers for 
two linked lists (shown in Figure 8), one for actions to per- 
form upon an abort and another for actions to perform upon 
a commit. Each list node includes a pointer to  the next list 
element, a function pointer t o  call in order to  perform the 
compensation, and an arbitrary amount of data1 (for use by 
and interpreted by the compensation function). If a trans- 
action aborts, it performs the actions in the abort-actions 
list and discards the actions in the commit-actions list. On 
a commit, it does the inverse. To ensure that it leaves all 
data structures in a consistent state, as well as has a chance 
to register any necessary compensation actions, we don't 
handle an abort (i.e., restore the register checkpoint) while 
a transaction is paused. Instead, the abort is handled when 
the transaction is unpaused. 
In the proposed implementation compensating actions are 
not performed atomically with the transaction. While we 
have yet to  identify a circumstance where this is problem- 
atic, an alternative approach would enable the appearance 
of atomicity by serializing commit. Logically, if we prevent 
any other threads from executing during the execution of the 
be stored in memory. 
'TO avoid any dependences on the context in which the compen- 
when the paused regi0n a mem- sation action is performed, we require the programmer to encap- 
ory location covered by the transaction's write set, clean sulate any necessary context information intdthe compensation 
































































typedef struct comp-lists- struct comp-action-s { 
comp-action-t *abort-a comp-action-s *next; 
comp-actionf *commit-actions - functiont compfunc; 
) comp-listst; ata for compensation 
) comp-actiont; 
typedef void (*camp-function-t)(struct com , boo1 do-action); 
Figure 8: An architecture for registering compensation actions. Each transactions maintains lists of actions to perform on a 
commit and on an abort. The do-action argument of compf unction-t indicates whether the compensation should be performed or the 
comp-act ion-t should just be deallocated. 
compensation code, we provide atomicity while enabling ar- 
bitrary non-memory operations in the compensation code. 
The implementation need not be quite this strict, as other 
transactions can be allowed to execute (but not commit) un- 
til they attempt to  access data touched by the committing 
transaction; if the compensation code touches data from an- 
other transaction, the other transaction must be aborted. If 
strong atomicity [3] is desired, non-transactional execution 
cannot proceed (as each instruction is logically a commit- 
ting transaction). Because such support for atomic com- 
pensation constrains concurrency, it could be designed to 
be invoked only when it was required. 
From a software engineering perspective, it is desirable to 
be able t o  write a single piece of code that can be called 
both from within a transaction (where it registers compen- 
sation actions) and from non-transactional code (where no 
compensation is required). To this end, the xact-pause in- 
struction returns a value that encodes both: 1) whether a 
transaction is running, and 2) whether the transaction was 
already paused. By testing this value, the software can d e  
termine whether compensating actions should be performed. 
Furthermore, by passing this value to the corresponding 
xact-unpause instruction, we can handle nested pause re- 
gions (without the VTM hardware having to  track the nest- 
ing depth) by clearing the pause XSW bit only if it was set 
by the corresponding ~ a c t - ~ a u s e ~ .  
Clearly, correctly writing paused regions with compensa- 
tion can be challenging, but they should not have to be 
written by most programmers. Instead, functions of this 
sort should generally be written by expert programmers 
and provided a s  libraries, much like conventional locking 
primitives and dynamic memory allocators. In the next sec- 
tion, we demonstrate how a dynamic memory allocator can 
be readily implemented using pause and compensation, be- 
cause programs generally do not rely on which memory is 
allocated. 
4.3 Pausing in Dynamic Memory Allocators 
Dynamic memory allocation is a staple of most modern 
programs and, due t o  the modular nature of modern soft- 
ware, likely to take place within large transactions. For this 
discussion, we will concentrate on C/C++-style memory al- 
location, but, as we will see, the motivation for pause goes 
beyond these particular languages. While we demonstrate 
the fundamental issues in a relatively simple malloc imple- 
mentation (Doug Lea's malloc, dlmalloc [14]), the same 
2~ similar idea could be used for xact-begin to support transac- 
tion nesting without keeping a nesting depth count. 
issues are present even in advanced parallel memory alloca- 
tors (e .g . ,  Hoard [2]). 
void *X, Y, Z = malloc( . . .  ) ;  
t ransact ion 1 
X = malloc(. . . ) ;  
f r e e  (Z) ; 
Y = malloc( . . .  1; 
free(X) ; 
1 
Figure 9: Example transaction that  includes memory al- 
location and deallocation. 
In Figure 9, we illustrate a short code segment that illus- 
trates the three cases that we have t o  correctly handle: 1) 
an allocation deallocated within the same transaction (X), 
2) an allocation within a transaction that lives past commit 
(Y), and 3) an existing allocation that is deallocated within a 
transaction (Z). In executing this code (and code like it), we 
want to  ensure two things: 1) we don't want to leak memory 
allocated within a transaction (even if an abort occurs), and 
2) we want to free memory exactly once and not irrevoca- 
bly so until the transaction commits. As will be seen, by 
correctly handling cases 2 and 3, case 1 is handled as well. 
Here, we consider two implementations of malloc: the 
first is quite straightforward (and merely for illustration), 
executing the whole malloc library non-transactionally and 
the second where pausing and compensation is only used to  
deal with the non-idempotent system calls mmap and munmap. 
In the first implementation, we construct new wrappers 
for the functions malloc and f r ee .  The wrappers, which 
comprise nearly the entire change to  the library, are shown 
in Figure 10. The malloc wrapper first pauses the transac- 
tion, then (non-transactionally) performs the memory allo- 
cation. Then, if the code was called from within the transac- 
tion, it registers an abort action that will f r e e  the memory, 
preventing a memory leak if the transaction gets aborted. 
If the transaction succeeds, the abort-actions list will be 
discarded. 
The case of deallocation is complementary. When f r e e  
is called from within a transaction, we do not want to ir- 
revocably free the memory until the transaction commits. 
As such, when executed inside a transaction, our wrapper 
does nothing but register the requested deallocation in the 
commit-actions list. If the transaction aborts, this list will 
be discarded. Only when the transaction commits will the 
deallocation actually be performed. Concurrent accesses t o  
the memory allocator are handled using the library's exist- 
_lists_s {
_action_t · r _ ctions;










/I data for co pensation
} _ ti n_t;
i · o p_ ti _ p_action_s ·ca, l _action);
_ ..functi _t


















































void *malloc(size-t bytes) { 
void *ret-val; 
int pause-state = 0; 
XACT-PAUSE (pause-state) ; 
ret-val = malloc-internal (bytes) ; 
if (INSIDE-A-TRANSACTION(pause-state)) { // if in a transaction, register compensating action 
comp-lists-t *camp-lists = NULL; 
XACT-COMP-DATA(comp-lists) ; / /  get a pointer to the compensation lists 
free-comp-action-t *fca = (free-comp-action-t *)malloc~internal(sizeof(free~comp~action~t)); 
fca->camp-function = free-comp-function; 
fca->ptr = ret-val; 
fca->next = comp-lists->abort-actions; 





void free(void* mem) 1 
int pause-state = 0; 
XACT-PAUSE(pause-state) ; 
if (INSIDE-A-TRANSACTION(pause-state)) { // if in a transaction, defer free until commit 
comp-lists-t *camp-lists = NULL; 
XACT-COMP-DATA(comp-lists); / /  get a pointer to the compensation lists 
free-comp-action-t *fca = (free-comp-action-t *)malloc~internal(sizeof(free~comp~action~t)); 
fca->camp-function = free-comp-function; 
fca->ptr = mem; 
fca->next = comp-lists->commit-actions; 
comp-lists->commit-actions = (comp-action-t *)fca; 
) else 
free-internal (mem) ; 
typedef struct free-comp-action-s { 




void free~comp~function(comp~action~t *ca, int do-action) { 
if (do-action) 
free-comp-action-t *fca = (free-comp-action-t *)ca; 
free-internal(fca->ptr); 
> 
f ree-internal (ca) ; 
> 
Figure 10: Wrappers for malloc and f r e e  that perform them non-transactionally. The original versions of malloc and 
f r e e  have been renamed as malloc-internal and f ree- in ternal ,  respectively. When executed within a transaction, malloc registers a 
compensation action that frees the allocated block in case of an abort, and f r e e  does nothing but register a commit action that actually 
frees the memory. To register compensation actions, the transaction must dynamically allocate memory (note the use of malloc-internal)  





et_v all _i t t ;
INSID _ _TRANSACTION(pau e_state)) II ra io i pensat t
p_list _ o p_lis ;
CT_COMP_DATA(comp_lists); II t i t pensat
ree_com _ t on_t fre _co _act _ all _ ernal(sizeof( ree_comp_action_t));
ca->co _f t on ree_com _f on;
c -> r t_v ;
ca-> p_list abort_acti s;








INSID _ _ RANSACTION(pause_s ate)) II ra io f e til it
p_list _ o p_lis ;
CT_COMP_DATA(comp_lists); II t i t pensat
ree_com _ on_t fre _co _act _ all _i rnal(sizeof(free_comp_action_t));
ca->co _f t on ree_com _f on;
c >
ca->n p_list > mit_acti ns;






yp ru ree_com _ ion_s
ru p_act _ ;
p_functi _ p_f ct on;
t ;
} ree_com _ t on_t;
re _comp_function(comp_action_t _act
d _ t o {
ree_com _ t on_t fre _co _act _ ) a;
re _ l ptr);
}
ree_ ern (c ;
}
rapper e e orm hem n-t t l y ri i al ersi s f e
ree enam e ~ n e n , t he t it i a io e i t
pensat on o re h lo f r , t i t ist r it t t t all
frees h or . i pensat on ion , ra io ust a icall t e or s f e_internal)
in n h t f pensat on o l i r ).
ing mutual exclusion primitives. 
An alternative implementation executes the bulk of the 
memory allocator's code as part of the transaction. In the 
common case, the transactional memory system ensures that 
memory is not leaked: memory allocated/deallocated by an 
aborting transaction is restored by undoing the transaction's 
stores. Only when the allocator interacts with the kernel is 
there potential for a problem, as kernel activity is not in- 
cluded in the transaction for reasons of performance isola- 
tion 1281. Instead, the VTM hardware sets the transaction 
into a SWAPPED state during kernel execution, so system call 
activity is not rolled back on an abort. While this is per- 
haps not problematic for idempotent system calls like b r k 0  
and g e t p i d o ,  it is problematic for m a p ( ) ,  which is not 
idempotent. 
dlmalloc uses m a p 0  to  allocate very large chunks (> 
256kB) and when s b r k 0  cannot allocate contiguous chunks. 
When m a p 0  is called, the Linux kernel records the allo- 
cation (in a vm-area-struct), in part to  guarantee that it 
doesn't allocate the memory again. If a transaction calling 
mmap0 aborts, the application will have no recollection of 
the allocation, but the kernel will, resulting in memory leak 
of the virtual address space3. 
To prevent such a leak, we wrap the call to  m a p 0  in a 
paused region and register a compensation action to  munmap 0 
the region if the transaction is aborted, much in the same 
spirit as the malloc wrapper in Figure 10. Correspondingly, 
calls to munmap that are performed within transactions are 
deferred until the transaction commits. 
In general, this second approach is likely preferable, be- 
cause less effort has to  be spent registering and disposing 
of compensation actions. The primary drawback of this a p  
proach is that conflicts will result if multiple transactions 
try to  allocate memory from the same pool, but this prob- 
lem can be largely mitigated by using a parallel memory 
allocator ( e . g . ,  Hoard. 1121) that provides per-thread pools of 
free memory. 
5. RELATED WORK 
Concurrently with this work, Carlstrom et al. proposed an 
implementation of open nesting to  handling high-contention 
and actions with non-memory-like semantics [17]. In many 
respects, their implementation of abort/commit actions is 
similar to  ours, with one noteworthy exception: they guar- 
antee that the abort/commit handlers execute atomically 
with the transaction by performing it during the commit 
process and preventing other transactions from committing 
simultaneously. While this programming abstraction is cleaner, 
it can also serialize commit unnecessarily; for example, atom- 
icity is not required in our malloc example. The best of both 
worlds may be to  support both approaches and allow the 
programmer to make the simplicity/performance trade-off 
themselves. 
Also noteworthy in the work, they deride the notion of 
a transactional pause primitive as "redundant and danger- 
ous." In contrast, we don't view the two primitives as mu- 
tually exclusive, but rather as representing slightly differ- 
ent trade-offs in software complexity and capability. While 
open-nesting provides a cleaner programming interface by 
- 
3 ~ o  avoid errors of this sort in general, we've modified the Linux 
kernel to kill unpaused transactions in the system-call0 inter- 
rupt vector. 
eliminating the lock-based concerns of paused regions, the 
fact that both will require compensation code ensures that 
neither will be written except by expert programmers. Paus- 
ing, however, unlike open nesting, enables transactions to  
contain code not written in transactions. We believe that it 
is unlikely that transactions will completely replace locks for 
reasons of performance isolation (especially with respect to  
kernel execution [28]) as well as legacy code. In addition, be- 
cause composition of paused regions is handled in software, 
we do not have the handle the complexity of supporting ar- 
bitrary nesting in hardware, a topic not yet handled by the 
literature for hardware support of open nested transactions. 
Also, the ATOMOS extensions to  Java [4], work done 
concurrently with our implementation, also provide an im- 
plementation of r e t ry .  The major differences between the 
implementations are two-fold: 1) the ATOMOS implemen- 
tation requires the programmer to explicitly identify the set 
of values on which to wait using the "watch" primitive; 
requiring explicit identification of the watch set presents 
the possibility that a programmer will omit necessary items 
and as well as a software maintenance headache, without a 
clear need for the enabled selectivity, 2) the ATOMOS im- 
plementation requires a processor to  be dedicated to  serve 
as a thread scheduler, a requirement that seems to  derive 
from the fact that transactions cannot live across context 
switches. In a machine with a conventional virtual memory 
system, it seems likely that one scheduler processor would 
be required for each virtual address space, and it is unclear 
what happens if the composite watch set of many threads 
exceeds the size of what can be supported directly by the 
transaction hardware. In contrast, our implementation sup- 
ports waiting on the whole existing read set and requires no 
dedicated processors due to  VTM's existing support of "un- 
bounded" transactions that can survive context switches. 
6. CONCLUSION 
With highly-concurrent machines prominently on the main- 
stream roadmaps of every computer vendor, it is clear that 
a program's degree of concurrency will be the primary fac- 
tor affecting its performance. This paper reflects our belief 
that the power of transactional memory will not be in how 
it performs on applications that have already been paral- 
lelized, but in how it enables new applications to be paral- 
lelized. In particular, many applications that have yet to  be 
parallelized have inherent parallelism, but not of a regular 
sort that can be expressed with DOALL-type constructs. In- 
stead, the parallelism is unstructured - requiring significant 
effort on the programmer's part to  manage the concurrency 
using traditional means - and exists in varying granulari- 
ties. The key goal of a transactional memory system should 
be to allow the programmer to  trivially express the existence 
of this potential concurrency at  its natural granularity. 
A key component of this strategy is providing the pro- 
grammer with those primitives that facilitate the expres- 
sion of parallelism. While previous work on hardware trans- 
actional memory has shown to  support the atomic execu- 
tion of arbitrarily sized regions of normal code, it has yet 
to provide the richness of the interface provided by soft- 
ware transactional memory systems. This paper attempts to 
shrink the functionality gap between software transactional 
memory systems and hardware ones, through demonstrat- 
ing how a hardware TM can interface with a software thread 




















































cesses wi thin  a transaction memory  system. Furthermore, 
w e  show that functionally, these  techniques represent small 
extensions t o  existing proposals for hardware transactional 
memory. 
7. ACKNOWLEDGMENTS 
T h i s  research was supported i n  part b y  NSF CCR-0311340, 
NSF C A R E E R  award CCR-03047260, and a gift  f rom t h e  In- 
te l  corporation. W e  thank  Brian Greskamp,  Pierre Salverda, 
Naveen Neelakantam, Ravi Rajwar, and t h e  anonymous re- 
viewers for feedback o n  this  work. 
8. REFERENCES 
[I] C.  S. Ananian, K. AsanoviC, B. C. Kuszmaul, C .  E. 
Leiserson, and S. Lie. Unbounded Transactional Memory. 
In Proceedings of the Eleventh IEEE Symposium on 
High-Perfomance Computer Architecture, Feb. 2005. 
1121 E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. 
Wilson. Hoard: A Scalable Memory Allocator for 
Multithreaded Applications. In Proceedings of the Ninth 
International Conference on Architectural Support for 
Programming Languages and Operating Systems, Nov. 
2000. 
[3] C. Blundell, E. C .  Lewis, and M. M. Martin. 
Deconstructing Transactional Semantics: The  Subtleties o f  
Atomicity. In Proceedings of the Fourth Workshop on 
Duplicating, Deconstructing, and Debunking, June 2005. 
[4] B. D. Carlstrom, A. McDonald, H. Chafi, J .  Chung, C.  C. 
Minh, C .  Kozyrakis, and K. Olukotun. T h e  ATOMOS 
Transactional Programming Language. In Proceedings of 
the SIGPLAN 2006 Conference on Programming Language 
Design and Implementation, June 2006. 
151 C. Click. A Tour inside the Azul 384-way Java Appliance: 
Tutorial held in conjunction with the Fourteenth 
International Conference on Parallel Architectures and 
Compilation Techniques (PACT) ,  Sept. 2005. 
[6] A. A. Farrag and M. T .  Ozsu. Using semantic knowledge o f  
transactions t o  increase concurrency. ACM Transactions 
on Database Systems, 14(4):503-525, 1989. 
[7] H. Garcia-Molina. Using Semantic Knowledge for 
Transaction Processing in Distributed Database. ACM 
Transactzons on Database Systems, 8(2):186-213, 1983. 
[8] L. Hammond, V .  Wong, M. Chen, B. D. Carlstrom, J .  D. 
Davis, B. Hertzberg, M .  K. Prabhu, H. Wijaya, 
C. Kozyrakis, and K. Olukotun. Transactional Memory 
Coherence and Consistency. In Proceedings of the 31st 
Annual International Symposium on Computer 
Architecture, pages 102-113, June 2004. 
[9] T .  Harris, S. Marlowe, S. Peyton-Jones, and M. Herlihy. 
Composable Memory Transactions. In Principles and 
Practice of Parallel Programming (PPOPP), 2005. 
[lo] M. Herlihy, V .  Luchangco, M. Moir, and W .  N .  S. 111. 
Software Transactional Memory for Dynamic-Sized Data 
Structures. In Proceedings of the Twenty-Second 
Symposium on Principles of Distributed Computing 
(PODC), 2003. 
[ I l l  M. Herlihy and J .  E. B. Moss. Transactional Memory: 
Architectural Support for Lock-Free Data Structures. In 
Proceedings of the 20th Annual International Symposium 
on Computer Architecture, pages 289-300, May 1993. 
[12] W .  N. S. I11 and M. L. Scott. Advanced Contention 
Management for Dynamic Software Transactional Memory. 
In Proceedings of the Twenty-Fourth Symposium on 
Principles of Distributed Computing (PODC), 2005. 
[13] H. F. Korth, E. Levy, and A. Silberschatz. A Formal 
Approach t o  Recovery by Compensating Transactions. In 
Proceedings of the 16th International Conference on Very 
Large Data Bases, pages 95-106, 1990. 
[14] D. Lea. A memory allocator, 
http://gee.cs.oswego.edu/dl/html/malloc.html. 
[15] D. Lomet. Process structuring, synchronization, and 
recovery using atomic actions. In Proceedings of the ACM 
Conference on Language Design for Reliable Software, 
pages 128-137, Mar. 1977. 
[16] P. S. Magnussen et al. Simics: A Full System Simulation 
Platform. IEEE Computer, 35(2):50-58, Feb. 2002. 
[17] A.  McDonald, J .  Chung, B. D. Carlstrom, C. C .  Minh, 
H. Chafi, C. Kozyrakis, and K. Olukotun. Architectural 
Semantics for Practical Transactional Memory. In 
Proceedings of the 33rd Annual International Symposium 
on Computer Architecture, June 2006. 
[18] K. E. Moore, J .  Bobba, M. J .  Moravan, M. D. Hill, and 
D. A .  Wood. LogTM: Log-based Transactional Memory. In 
Proceedings of the Twelfth IEEE Symposium on 
High-Performance Computer Architecture, Feb. 2006. 
[I91 E. Moss and T .  Hosking. Nested Transactional Memory: 
Model and Preliminary Architecture Sketches. In 
Proceedings of the workshop on Synchronization and 
Concurrency in  Object-Oriented Languages (SCOOL), 
2005. 
[20] R. Rajwar and J .  R.  Goodman. Transactional Lock-Free 
Execution o f  Lock-Based Programs. In Proceedings of the 
Tenth International Conference on Architectural Support 
for Programming Languages and Operating Systems, Oct. 
2000. 
[21] R. Rajwar and J .  R .  Goodman. Speculative Lock Elision: 
Enabling Highly Concurrent Multithreaded Execution. In 
Proceedings of the 28th Annual International Symposium 
on Computer Architecture, July 2001. 
[22] R. Rajwar, M. Herlihy, and K. Lai. Virtualizing 
Transactional Memory. In Proceedings of the 32nd Annual 
International Symposium on Computer Architecture, June 
2005. 
[23] D. J .  Rosenkrantz, R. Stearns, and P. Lewis. System level 
concurrency control for distributed database systems. ACM 
Transactions on Database Systems, 3(2):178-198, June 
1978. 
[24] H. Sutter and J .  Larus. Software and the Concurrency 
Revolution. ACM Queue, 3(7):54-62, Sept. 2005. 
[25] C. A .  Thekkath and H. M. Levy. Hardware and Software 
Support for Efficient Exception Handling. In Proceedings of 
the Sixth International Conference on Architectural 
Support for Programming Languages and Operating 
Systems, Oct. 1994. 
[26] S. Vaucouleur and P. Eugster. Atomic features. In 
Proceedings of the workshop on Synchronization and 
Concurrency i n  Object-Oriented Languages (SCOOL), 
2005. 
[27] B. Verghese, A. Gupta, and M. Rosenblum. Performance 
Isolation: Sharing and Isolation in Shared-Memory 
Multiprocessors. In Proceedings of the Eighth International 
Conference on Architectural Support for Programming 
Languages and Operating Systems, pages 181-192, Oct. 
1998. 
[28] C.  Zilles and D. Flint. Challenges to  Providing Performance 
Isolation in Transactional Memories. In Proceedings of the 
Fourth Workshop on Duplicating, Deconstructing, and 
Debunking, pages 48-55, June 2005. 
s s it i tr s cti r t . rt r r ,
t t ti lly, t t i t ll





l ti . e eska p, ,
l t , i j ,
i .
.
[1] . . , . ic . . .
s , . .
i
i h-Perfor e t t , . .
1:2] . . , . . l , . . fe,





[3] . ll, . . is, . ti
f
t s
li ti , .
[4] . . rlstro , . l , . fi, . , . .




1 ] . lick. i
t
l f e ce r ll
il tion ACT), t. .
[6] . . . . . f
t s cti ti
t s, .
[7] . rcia- oli .
cti t d
i , .
[8] . , . g, . n, . rlstrom, .
vi , . t , . . . ,
. kis, . . t l
sistency.
.
[9] . rris, . l , . li
l s ctions.
t P), .
[10] . rli y, . , . , . . . III.




[11] . l . . . . t l :
tural
i
t , , 3.
[12] . . . II . . tt. ti
t i t ti l .
i t t t i
i i l i t i t ti ), .
[13J . . th, . , . il tz. l
cti ns.
i s f t t I t ti l f
t s, - , .
[ 4] . . r ll t r,
ttp://ge .cs.osweg .ed / l/html/m ll .ht l.
[ 5] . t. r t t i , r i ti ,
r r si t i ti s. I r i s f t
f r e si f r li le ft r ,
, . .
[ 6] . . t l. i i : ll t i ulati
tf . t r, : - , . .
[ 7] . l , . , . . rl tr , . . i ,
. fi, . , . rchitectural
ical r i al




t t r , . .
1 1 . sted r ti nal




0J ar . r c -Free
- ased f
l it tural rt
.
1] ar . l
r t ltithrea ed ti .
f al ti l
l








4] t r ftware
, - ,











f t t tional
t r l t f r ing
i s ,
.
8] . i g
i al f
ti ,
i , -55, .
Session 3: Language Design, Specification, and Analysis Session 3: Language esign, Specification, a al si

Transactional memory with data invariants 
Tim Harris Simon Peyton Jones 
Microsoft Research, Cambridge 
{tharris,simonpj)@rnicrosoft.corn 
Abstract 
This paper introduces a mechanism for asserting invariants that are 
maintained by a program that uses atomic memory transactions. 
The idea is simple: a programmer writes check E where E is an 
expression that should be preserved by every atomic update for 
the remainder of the program's execution. We have extended STM 
Haskell to dynamically evaluate check statements atomically with 
the user's updates: the result is that we can identify precisely which 
update is the first one to break an invariant. 
1. Introduction 
Atomic blocks provide a promising simplification to the problem of 
writing concurrent programs [91. A code block is marked atomic 
and the compiler and runtime system ensure that operations within 
the block, including function calls, appear atomic. The programmer 
no longer needs to worry about manual locking, low-level race con- 
ditions or deadlocks. Atomic blocks are typically built using soft- 
ware rraitsactiortal memory (STM) which allows a series of mem- 
ory accesses made via the STM library to be performed atomically. 
This approach is sometimes described as being "like A and I" 
from ACID database transactions; that is, atomic blocks provide 
atomicity and isolation, but do not deal explicitly with consistency 
or durability. This paper attempts to include "C" as well, by show- 
ing how to define dynamically-checked data invariants that must 
hold when the system is in a consistent state. Specifically, we make 
the following contributions: 
We propose a simple but powerful new operation, check E, 
where E is an expression that must run without raising an ex- 
ception after every transaction (Section 3). For example, given 
a predicate i s s o r t e d  to test whether the data in a mutable list is 
sorted, an invariant check ( a s s e r t  ( i s s o r t e d  11) )  would 
cause a11 error to be issued if any atomic block attempts to 
commit with the list 11 unsorted. Furthermore, we can pinpoint 
exactly which atomic block attempted to violate the invariant. 
Using atomic blocks provides us with a key benefit over ex- 
isting work on dynamically-checked invariants: the boundaries 
of atomic blocks indicate precisely where invariants must hold. 
They may, and often must, be broken within transactions, some- 
thing that causes trouble in other systems (Section 7). 
Furthermore, the programmer has fine control over the granu- 
larity of invariant checking. She may specify coarse-grain in- 
variants on large, global data structures, or fine-grain invariants 
on individual parts of those structures (e.g. Section 3.2). 
A distinctive feature of our work is that we give a complete, 
precise (but still compact) operational semantics of check in 
Section 4, by extending our earlier semantics for STM Haskell. 
This semantics gives a precise answer to questions such as: 
what happens if the invariant updates the heap, loops, or blocks? 
One might worry that, since invariants can be dynamically 
added but never deleted, the system will run slower and slower 
as more invariants are added. In Section 5 we show how to take 
advantage of the existing STM transaction logging mechanism 
to ensure that ( i )  invariants are only checked when a variable 
read by the invariant is written by a transaction, and ( i i )  invari- 
ants are garbage-collected entireiY when the data structures they 
watch are dead. These properties are the key to scalability. 
In Section 6 we show how the operations supported by our 
invaliants can be extended to express conditions relating pairs 
of program states ("XYZ is never decreased"), rather than just 
inspecting the current state ("XYZ is never zero"). 
The idea of combining data invariants with transactions is not new 
- indeed, the POSTQUEL query language from 1986 included a 
similar command that could be used to describe kinds of transaction 
that could not be committed against a database 1241. Section 7 
discusses related work in thatfield, along with other work on 
incorporating invariants into programming languages. 
We present our design in the context of STM Haskell [lo] be- 
cause this setting allows us to bring out the key issues in partic- 
ularly crisp form. Everything we describe is fully implemented in 
the Glasgow Haskell Compiler, GHC, and will shortly be publicly 
available at the GHC home page. However, we believe that the 
ideas of the paper could readily be applied in other languages, as 
we discuss in Section 8. 
2. Background: STM Haskell 
Our prototype is based on STM Haskell 1101, summarized in Fig- 
ure 1. In this section we briefly review the language for the benefit 
of readers not already familiar with it. 
STM Haskell is itself built on Concurrent Haskell [20] which 
extends Haskell98, a pure, lazy, functional programming language. 
It provides explicitly-forked threads, and abstractions for commu- 
nicating between them. These constructs naturally involve side ef- 
fects which are accommodated in the otherwise-pure language a 
mechanism called nzonads 1251. The key idea is this: a value of 
type I0 a is an "110 action" that, when performed may do some 
inputloutput before yielding a value of type a. For example, the 
functions putchar and getchar have types: 
pu tchar  : :  Char -> I0 
ge tchar  : : I0 Char 
That is, putchar takes a Char and delivers an I10 action that, 
when performed, prints the string on the standard output; while 
ge tchar  is an action that, when performed, reads a character from 
the console and delivers it as the result of the action. A complete 
program must define an 110 action called main; executing the 
program means performing that action. 

























































For example: -- The STM monad i t s e l f  
d a t a  STM a 
ins tance  Monad STM main : :  I0  0 
main = putchar  ' x '  
I10 actions can be glued together by a monadic bind combinator. 
This is normally used through some syntactic sugar, allowing a C- 
like syntax. Here, for example, is a complete program that reads a 
character and then prints it twice: 
main = do c <- g e t c h a r ;  putchar  c ;  putchar  c ) 
Threads in STM Haskell communicate by reading and writing 
transactional variables, or TVars. The operations on TVars are as 
follows: 
d a t a  TVar a 
newTVar : : a -> STM (TVar a )  
readTVar :: TVar a -> STM a 
writeTVar : :  TVar a -> a -> STM 0 
All these operations make use of the STM monad, which supports 
a carefully-designed set of transactional operations, including allo- 
cating, reading and writing transactional variables. The readTVar 
and writeTVar operations both return STM actions, but Haskell 
allows us to use the same do C. . .) syntax to compose STM ac- 
tions as we did for I10 actions. These STM actions remain tentative 
during their execution: in order to expose an STM action to the rest 
of the system, it can be passed to a function atomic, with type: 
atomic : :  STM a -> I0  a 
It takes a memory transaction, of type STM a, and delivers an I10 
action that, when performed, runs the transaction atomically with 
respect to all other memory transactions. One might say: 
main = do C . . . ; atomic (writeTVar r 3) ; . . . 1 
Operationally, atomic takes the tentative updates and actually ap- 
plies them to the TVars involved, thereby making these effects vis- 
ible to other transactions. The atomic function and all of the STM- 
typed operations are built over the software transactional mem- 
ory. This deals with maintaining a per-thread transaction log that 
records the tentative accesses made to TVars. When atomic is in- 
voked the STM checks that the logged accesses are valid - i.e. no 
concurrent transaction has committed conflicting updates. If the log 
is valid then the STM commits it atomically to the heap. Otherwise 
the memory transaction is re-executed with a fresh log. 
Splitting the world into STM actions and I10 actions provides 
two valuable guarantees: (i) only STM actions and pure computa- 
tion can be performed inside a memory trai~saction; in particular 
110 actions cannot; (ii) no STM actions can be performed outside a 
transaction, so the programmer cannot accidentally read or write a 
TVar without the protection of atomic. Of course, one can always 
write atomic (readTVar v) to read a TVar in a trivial transac- 
tion, but the call to atomic cannot be omitted. 
As an example, this procedure atomically increments a TVar: 
incT : : TVar I n t  -> I 0  0 
incT v = atomic (do x <- readTVar v 
writeTVar v ( x + l ) )  
The implementation guarantees that the body of a call to atomic 
runs atomically with respect to every other thread; for example, 
there is no possibility that another thread can appear to read v 
between the readTVar and writeTVar of incT. 
Although less relevant to our current paper, STM Haskell also 
provides facilities for co~nposable blocking. The first construct is a 
r e t r y  operation: 
-- Exceptions 
throw : :  Exception -> STM a 
catch : :  STM a -> (Exception->STM a)  -> STM a 
-- Running STM computations 
atomic : :  STM a -> I0 a 
r e t r y  : :  STM a 
orElse : :  STM a -> STM a -> STM a 
-- Transactional va r i ab l e s  
da ta  TVar a 
newTVar :: a -> STM (TVar a)  
readTVar :: TVar a -> STM a 
writeTVar :: TVar a -> a -> STM 0 
Figure 1. The language level interface to transactional memory in 
STM Haskell 
r e t r y  : :  STM a 
The semantics of r e t r y  is to abort the current atomic transaction, 
and re-run it after one of the transactional variables it read from 
has been updated. For example, here is a procedure decT that 
decrements a TVar, but blocks if the variable is already zero: 
dec : :  TVar I n t  -> STM 0 
dec v = do x <- readTVar v 
i f  x == 0 
then  r e t r y  
e l s e  writeTVar v (x-1) 
decT : :  TVar I n t  -> 10 0 
decT v = atomic (dec v) 
Finally, the infix o rElse  function allows two transactions to be 
tried in sequence: ( s l  ' o rElse '  s2 )  first attempts s l ;  if that 
calls r e t r y ,  then s 2  is tried instead; if that retries as well, then 
the entire call to o rElse  retries. For example, this procedure will 
decrement v l  unless v l  is already zero, in which case it will 
decrement v2 instead. If both are zero, the thread will block: 
decPa i r  : :  TVar I n t  -> TVar I n t  -> I0  0 
decPai r  v l  v2 = atomic (dec v l  ' o rElse '  dec v2) 
In addition, the STM code needs no modifications at all to be ro- 
bust to exceptions. The semantics of atomic is that if the transac- 
tion fails with an exception, then no globally visible state change 
whatsoever is made. 
Note that since our original paper on STM Haskell [lo], we 
realized that the type 'STM a' might, more clearly, be called 
'Atomic a'  and that the function atomic could be renamed 
'perform'. The new names would make it clearer that operations 
such as readTVar and writeTVar are individual atomic actions 
that are combined monadically to form larger compound atomic 
actions, and also that perform is used only when actually mak- 
ing such a compound action visible to concurrent threads (rather 
than being necessary at every level when calling one transactional 
function from another). For consistency we are sticking with the 
published names, but mention the alternatives in case they help 
readers unfamiliar with the language. 
3. The main idea 















































: 1 ) ()
-i














check : :  STM a -> STM 0 
Informally, check takes an STM computation that tests an invariant 
and, in addition, adds it to a global set of such invariants. At the end 
of every user transaction, every invariant in the global set must be 
satisfied if the user transaction is to be allowed to commit. If any 
invariant fails, indicated by throwing an exception, then the user 
transaction is rolled back and the exception propagates. 
Since invariant checks are run repeatedly, and in an unspecified 
order, it is clearly desirable that they do not perform side effects or 
input/output. Our design partly offers this guarantee by construc- 
tion: since the argument to check is an STM computation, the type 
system guarantees that it performs no input/output. Of course, as 
an STM computation, it can call writeTVar to attempt to update 
transactional memory - or, indeed, it can attempt any of the other 
actions in the STM monad. To avoid this kind of side-effect we use a 
fresh nested transaction to check each invariant and then roll-back 
this traruactiotl whether or not the irzvariarzt succeeds. We give a 
fully-precise specification in Section 4, but first we discuss our de- 
sign informally in the rest of this section. 
In this section we introduce a number of examples showing 
how invariants can be defined. In inany of our examples we use 
simple data structures built from TVars holding integer values. In 
Haskell, as in other languages, these examples could be written 
more generally to act across multiple types; we stick to integers 
for simplicity rather than due to limitations in the design or the 
implementation. For simplicity we also stick with straightforward 
imperative data structures. 
3.1 Example 1: range-limited TVars 
Consider the following example in which the type LimitedTVar 
holds a range-limited integer value. The function newLimitedTVar 
constructs a LimitedTVar with a specified limit. incLimitedTVar 
attempts to increment the value: 
t y p e  LimitedTVar = TVar I n t  
newLimitedTVar : :  I n t  -> STM LimitedTVar 
newLimitedTVar l i m  = 
do { t v  <- newTVar 0 
; check (do { v a l  <- readTVar t v  
; a s s e r t  ( v a l  <= l im)  3) 
; r e t u r n  t v  ) 
incLimitedTVar : :  I n t  -> LimitedTVar -> STM 0 
incLimitedTVar d e l t a  t v  
= do C v a l  <- readTVar t v  
; writeTVar t v  ( v a l + d e l t a )  3 
A key point is that the invariant is associated with the creation of 
the LimitedTVar, and not with its (perhaps diverse) Lues. A pro- 
grammer therefore can be confident that every LimitedTVar will 
always obey its invariant, rather than wondering whether perhaps 
one errant use has fallen though the net. The second key point is 
that the invariant is checked only at the end of (every) transaction; 
the invariant may temporarily be broken during a transaction. For 
example, a particular transaction may increase the variable beyond 
its limit provided that the same transaction decreases it again before 
the transaction ends. It is not useful, for example, to test the invari- 
ant every time the variable is written. Finally, it is worth noting that 
the invariant is a first-class closure; for instance it has a free vari- 
able l i m  that is not recorded in the LimitedTVar data structure at 
all. 
An invariant may of course describe a relationship between 
mutable variables. For example, a limited TVar with a mutable limit 
might be described thus: 
d a t a  Limit edTVarM 
= LTV { v a l  : :  TVar I n t ,  l i m i t  : :  TVar I n t  ) 
Now the invariant-check would read both the v a l  and l i m i t  
TVars, and compare them, failing if they do not stand in the de- 
sired relationship. 
3.2 Example 2: a sorted list 
Our second example illustrates the trade-offs involved in express- 
ing the same invariant in different ways. Consider the following 
definition of a singly linked list of integers: 
d a t a  ListNode 
= ListNode { v a l  :: TVar I n t ,  
next  : :  TVar (Maybe ListNode) ) 
Each ListNode holds a TVar I n t  which we will call the node's 
value, and a reference to a Maybe ListNode which we will call 
the next node. In Haskell, the type Maybe ListNode is essentially 
a nullable reference to a ListNode - its value is either Nothing 
(null), or J u s t  11 (a reference to 11). A Nothing rlest node 
indicates the end of the list. 
If a list is to be held in sorted order then, informally, an in- 
variant for all nodes could be "the next node is either null, or the 
next node's value is larger than this node's value". This could be 
expressed as: 
validNode : :  ListNode -> STM 0 
-- Throws excep t ion  f o r  i n v a l i d  node 
validNode ListNode { v a l  = v-val ,  nex t  = v-next ) 
= do { next-node <- readTVar v-next 
; case next-node of -- C 1  
Nothing -> r e t u r n  0 
ListNode C v a l  = v-next-val ) -> 
do { t h i s - v a l  <- readTVar v-val 
; next-val  <- readTVar v-next-val 
; a s s e r t  ( t h i s - v a l  <= next-val)  ) 
) 
Case statement C 1  examines the contents of v-next: if it holds 
Nothing, then the invariant holds and we simply return; otherwise, 
the value fields of the two nodes are read and compared. 
As with the first example, we could integrate this invariant with 
a function that constructs list nodes: 
newListNode : :  I n t  -> STM ListNode 
newListNode v a l  
= do C v-val <- newTVar v a l  
; v-next <- newTVar Nothing 
; l e t  r e s u l t  = ListNode { v a l  = v-val ,  
next  = v-next ) 
; check (validNode r e s u l t )  
; r e t u r n  r e s u l t  ) 
This approach is effective if all ListNodes should occur in sorted 
lists. But perhaps some lists are sorted, and some are not - what 
then? In such cases the invariant could perhaps be expressed better 
as a property of a larger data structures: 
v a l i d L i s t  : :  ListNode -> STM 0 
v a l i d L i s t  lnO(ListNode { next  = v-next )) 
= do C r <- validNode I n  -- Check f i r s t  node 
; next-val  <- readTVar v-next 
; c a s e  next-val  of 
Nothing -> r e t u r n  0 
J u s t  I n '  -> v a l i d L i s t  I n J  ) 
The code instantiating the list can now assert that v a l i d L i s t  is 
always true, rather than expressing a per-node invariant. 
: ()






















































_ l _ l
_ al _next_val














In~ tNod _ ext })
{ t





The choice between these two approaches is largely a matter of 
taste and engineering. This example lets us raise two more issues 
beyond those already highlighted: ( i )  using per-node invariants 
enables more precise error reports ("node XYZ is out of order", 
versus "something in list ABC is out of order"), and ( i i )  in our 
implementation, per-node invariants may perform better: if the list 
is updated then only invariants in the vicinity of the update are re- 
checked, rather than the whole list being scanned. 
3.3 Example 3: invariants over state pairs 
Our third example illustrates a kind of invariants which carznot be 
expressed in STM Haskell. Suppose that we wish to create a non 
decreasing TVar, holding an integer value that is never allowed to 
be decreased by a transaction. We might attempt such a definition 
as follows: 
newNonDecreasingTVar : :  I n t  -> STM (TVar I n t )  
newNonDecreasingTVar v a l  
= do ( r <- newTVar v a l  
; p <- newTVar v a l  
; check (do ( c-val <- readTVar r 
; p-val <- readTVar p 
; a s s e r t  (p-val <= c-val)  
; writeTVar p c-val -- W1 
1) 
; r e t u r n  r ;  
1 
The intention here is that r refers to the TVar holding the non- 
decreasing value, that p refers to r ' s  previous value, and that the 
check ensures that the previous value is less than the current value. 
Unfortunately this will not work - the write at W1 that is responsible 
for recording the previous value will be rolled back each time the 
invariant is checked. 
This example might make it appear tempting to allow some 
limited kind of updates to be made within invariant checks; there 
are many ways that the state modified by these updates could be 
kept distinct from the state visible to the application through its 
own TVars. 
Leaving aside the question of exactly how updates are camed 
from one invariant check to another, retaining any kind of update 
is problematic semantically. This is because runnirzg an invariant 
check is no longer an idempoterzt operation. For instance, consider 
the following example in which the invariant check maintains a 
counter, failing when the counter reaches 10: 
timebomb : : STM 0 
t imebomb 
= do ( c <- newTVar 0 
; check (do ( c-val  <- readTVar c 
; writeTVar c (c-val  + 1) 
; a s s e r t  (c-val < 10) 
1) 
1 
What should this mean? Must the check be performed on every 
transaction (failing when exactly 10 have been committed)? May 
the invariant be checked multiple times on every transaction - 
after all, the invariant updates a TVar (c) that it itself depends 
on. Conversely, is it permitted to elide checking this invariant at 
all - after all, it is not associated with any data reachable by the 
application? 
If such a definition is to be allowed then the only reasonable 
approach semantically would seem to be to execute it until it either 
fails or reaches a fixed point. This is not an attractive proposition 
in terms of performance and so we do notprovide any support for 
maintaining updatesfrom one invariant check to another. 
Having said that, as we return to in Section 6,  we can extend 
our system to support invariants such as newNonDecreasingTVar 
without allowing problems of the kind raised by timebomb. 
3.4 Example 4: invariants as guards 
Our final example illustrates a facet of our design on which we 
would particularly welcome feedback: what happens when an in- 
variant blocks? Recall that in STM Haskell, blocking is expressed 
by a r e t r y  statement being executed inside an atomic block. Se- 
mantically, this aborts the block and re-executes it from the start, 
although the implementation delays this re-execution until one of 
the TVars read by the block has been updated (without such an 
update the block would simply r e t r y  again, spinning uselessly). 
Suppose that we define a variant of the LimitedTVar type from 
Section 3.1 which blocks instead of failing (aside from naming, 
differences are highlighted in black): 
ne57El ockingTVar : : T n t  -> ST14 1 , i~ i tedTVnr  
newElockingTVar lix = 
cio { v-11 i- newTVar 0 
; always { lral <- readTVat- v-ri 
; i f  ( v a l  <= l im) 
t h e n  r e t u r n  0 
e l s e  r e t r y  1 
; r e t u r n  v-n 1 
The following atomic blocks create a TVar limited to 10, and then 
attempt to exceed that limit by incrementing it from 0 to 20: 
xs  <- atomic ( newBlockingTVar 10 1 -- A1 
-- i n t e r v e n i n g  code e l i d e d  
atomic ( incBlockingTVar 20 xs 1 -- A2 
What should this mean? One option is that it should simply be 
forbidden. An alternative option is that executing r e t r y  when 
checking an invariant is exactly the same as executing r e t r y  within 
the block being checked: A2 will block until the increment can 
succeed without breaching the limit (perhaps because of work done 
by a concurrent thread forked elsewhere). 
Our current semantics and implementation follow the latter 
alternative. As we discuss in the next section it is debatable whether 
this is the best choice here; however, it is reminiscent of how the 
SCOOP concurrency extensions for Eiffel interpret method pre- 
conditions as blocking guard conditions 121). 
3.5 Design choices 
The preceding examples illustrated a number of decisions taken 
in the design of check. The first four of these are genuine design 
decisions on which we have selected one particular option based on 
the intuition gained from our examples: 
[Dl] The granularity at which invariants are checked coincides 
with transaction boundaries. This follows many designs for database 
invariants and, of course, it is necessary to allow such as "all entries 
in list L1 must also be in list L2" to be broken inside transactions 
that must update one list and then the other. 
[DZ] An invariant must succeed both when it is passed to check. 
nrzd also when the transactiorz proposing it is committed. Our de- 
sign follows that of many of the database systems in Section 7.2. 
Although the decision that invariants must succeed when passed to 
check is debatable, it is essential that any new invariants succeed 
at the end of the tiansaction proposing them. This allows future 
invariant failures to be correctly identify the offending transaction. 
[D3] The check function is an STM action, and so it can be corn- 
posed with other STM actions in an atomic block. An early design 






































' B1oc i : 3 \1 LimitedTVar
B i r im




















j 8 , m-
8 .
O ,
atomic blocks. Our examples illuskate the benefit of having check 
be an STM action: it can be encapsulated in STM-typed constructor 
functions. 
[D4] The closure passed to check is itself an STM action: it pro- 
ceeds by reading directlyfror7z the TVars the the invariant depends 
on. This allows an invariant to re-use existing STM functions that 
may form part of the program logic. 
However, beyond these basic decisions, there are a number of cases 
where clear guidance does not follow from simple examples. To a 
large extent these are cases that a 'well behaved' invariant should 
not exercise: what if it updates TVars rather than just reading them, 
what if it loops, or what if it calls r e t r y ,  o rElse ,  or even check? 
We have explored two points in this design space. The first, 
in Section 3.6, is the one followed by our implementation and by 
Section 4's operational semantics. In this design we do not restrict 
the kinds of STM action that can be composed to form an invariant; 
instead we use nested transactions and roll-back to limit the kinds 
of side effect that can leak out from a badly behaved invariant. The 
second design, in Section 3.7 shows how we can use the Haskell 
type system to statically restrict invariants to only reading from 
TVars and performing pure computation. 
3.6 Unrestricted invariants 
Our first approach is to perfo~m each invariant check in a nested 
transaction, and to roll back this nested transaction whether or 
not the invariant succeeds. This means that the invariant can use 
TVars internally without being able to affect the application's data 
structures. 
This approach leads to the following behavior for 'badly be- 
haved' invariants: 
[D5] Ifan invariant does not terminate at the end of a transaction 
then the transaction does not terrnirtate. 
[D6] An invariartt may update TVars within its own exec~rtion. 
[D7] Ifan invariant evaluates to r e t r y  then the user transaction is 
aborted and re-executed (potentially afrer blockirtg until it is worth 
re-executing it). 
[D8] If an irzvariant executes a check statement, then the new 
invariant is checked at that point, but is not retained by the systern. 
Some of these design choices are open for debate. Two particular 
exalnples are the use of r e t r y  within invariants and the use of or- 
dinary (i.e. catchable) exceptions to indicate failures. Our example 
from Section 3.4 illustrates how an invariant incorporating r e t r y  
can remove the need to repeat a guard condition across multiple 
atomic blocks. 
We are somewhat uneasy with this kind of use. This is because 
it requires invariants to be checked at run-time: this is at odds with 
the intuition that testing could be disabled once a program appears 
to run without violations. 
3.7 Restricted invariants 
An alternative to the urtrestricted irtvariartts of Section 3.6 is to 
limit invaliants to only reading froin TVars. Doing so means that in- 
variants cannot have side effects on TVars, or call r e t r y ,  orElse,  
or check. 
This kind of restriction can be elegantly integrated with the in- 
terface to transactio~lal memory in STM Haskell. Figure 2 shows 
how. The STM type constructor gets an extra type argument, e,  that 
characterises the effets in the computation. Specifically, a compu- 
tation of type STM ReadOnly t performs only read effects, while 
one of type STM F u l l  t has arbitrary STM effects. The types 
-- Phantom types for different kinds of STM action 
data ReadOnly 
data Full 
-- The STM monad distinguishing between kinds 
-- of STM action 
data STM e a 
instance Monad (STM e) 
-- Exceptions 
throw :: Exception -> STM e a 
catch : :  STM e a -> (Exception->STM e a) -> STM e a 
-- Running STM computations 
atomic :: STM Full a -> I0 a 
retry :: STM Full a 
orElse : :  STM Full a -> STM Full a -> STM Full a 
-- Transactional variables 
data TVar a 
newTVar : :  a -> STM Full (TVar a) 
readTVar : :  TVar a -> STM e a 
writeTVar : :  TVar a -> a -> STM Full 
-- Invariants 
check : :  STM ReadOnly a -> STM Full () 
Figure 2. The language level interface to transactional memory in 
STM Haskell, distinguishing between actions that can perform any 
STM action ("STM Full") and those that can only read from TVars 
("STM ReadOnly"). 
ReadOnly and F u l l  are so-calledphantorn types; they have no data 
contructors and no values. 
The functions writeTVar, r e t r y ,  and orElse  in Figure 2 
all return F u l l  computations. In contrast, readTVar is polymor- 
phic in e ,  and hence can be used in both ReadOnly and F u l l  
contexts. The operations r e t u r n ,  (>>=), catch,  and throw are 
all similarly polymorphic, and hence are usable in both con- 
texts. The key funcion in Figure 2 is check: it takes a ReadOnly 
computation and returns a F u l l  computation. So, for example, 
check (readTVar x) is well-typed, while check ( r e t r y )  or 
check (writeTVar x v) is not. 
This design has its attractions: read-only invariants may be more 
amenable to static verification. and the irn~lementation does not 
need to track and roll-back their side effecis. Conversely, restiic- 
tions limit the kinds of existing function that can be used in in- 
variants - any algorithms that internally use W a r s  are prohibited, 
even if they do not clash with those used by the application. Fur- 
thermore, since executable invariants can still loop endlessly, it is 
not the case that check statements can be safely removed from an 
application once it runs without invariant failures. 
4. Operational semantics 
So far our discussion in Section 3 has been informal. It is hard to 
be sure that such descriptions cover all the combinations of these 
functions that might arise', so in this section we extend the formal, 
operational semantics of STM Haskell 1101 to include the check 
primitive. We follow the design for urtrestricted invariants from 
Section 3.6. 
Figure 3 gives the syntax of a fragment of STM Haskell. Terms 
and values are entirely conventional, except that we treat the ap- 
plication of monadic combinators, such as r e t u r n  and catch,  as 
- 
' As an example, even though we had completed a prototype implementa- 
tion, the case of executing one invariant that proposes a second invariant is 
something we did not anticipate until writhithese semantics. 
8T ll
ar 8


















ti lly t n
).


















hrow .. cepti 8
: 8T tion->8TM 8T
unni 8 putati
o i .. 8T ll 1
.. 8T ll
l .. 8 ll 8 ll 8 l






: 8T e nl 8T ll
l















I l , t -
t
i te iting the tics.
l'crtr~ . \ I .  .I. 
1 tlr<:t\<l s(Il1p I ' .  (2 
I l ca j~  (3 
, ~ l l o c ~ : l ~ l ~ ~ r l ~  2, 
Invariants R 
I:\ > 1 1 1 1 ~ l l l ~ ~ ~ l  
~ : o r l L ~ ' \ [ 5  
:\C{IL!JI 
1 ( I I - \ (' \ I ,  \ I  
r e t u r n  \ I  \ I > > =  \ 
pu tCh,~r  r getchar  
, t h r  oc  !I c a t c h  \ I  \ 
1 r e t r y  'il 'orElse' \ 
IornIO ) I  check M 
. . -- .. .\I! ' ( I ) ,  ( ) ~ j  
. 1 . .  - . \ I  
.: 1 , .  .\I 
::= { M }  
Figure 3. The syntax of values and terms. Definitions in gray come 
directly from those used with STM Haskell. Definitions in black 
indicate modifications. 
values. The do-notation we have been using so far is syntactic sugar 
for uses of r e t u r n  and >>=: 
do (x<-e; Q) = e >>= \x-> do (Q) 
do (e; Q) - e >>= \--> do (Q)  
do (e) = e 
Figure 4 gives a small-step operational semantics for the lan- 
guage. Definitions typeset in g a y  are identical to the original def- 
initions for STM Haskell. Definitions typeset in black show modi- 
fications or additions needed for check. We will first of all outline 
the structure of the definitions in this figure (Section 4.1) and then 
show how they are extended to support check (Section 4.2). 
4.1 Original semantics 
We begin by describing the operational semantics of STM Haskell 
without invariants. The material of this section is largely taken from 
[lo], but it is essential to understanding the changes for invariants. 
The semantics is given in Figure 4, which groups the existing 
transitions into three sets: 
The I 0  transitions are steps taken by threads. A transition 
P ;  Q,  R % Q; Q1, R' indicates a single step from a system with 
threads in state P transitions to one with threads in state Q. Theta 
( 0 )  is the state of the heap before the transition; 0' is the state of 
the heap after the transition. a is the I 0  action (if any) performed 
by the step. Omega (R) is the current set of invariants; we return to 
its role in section 4.2. 
The first two rules deal with input and output. If the active 
term is a pu tchar  or ge tchar  the appropriate labelled transition 
takes place, and the operation is replaced by a r e t u r n  carrying the 
result. Rule FORK allows a new thread to be created, by adding 
a new t e ~ m  M to the thread soup, allocating a fresh name t as its 
ThreadId. 
Rule ADMIN concerns administrative transitions, which are 
given in the second section of Figure 4. Rule EVAL allows a pure 
function M that is not a value to be evaluated by an auxiliary 
function, VI[M]I, which gives the value of M .  This function is 
entirely standard, and we omit it here. Rule BIND implements 
sequential composition in the monad. The rules THROW, CATCH1 
and CATCH2 implement exceptions in the standard way. All of 
these rules are, as we shall see, used both for I 0  transitions and 
STM transitions, which is why we keep them in a separate group. 
Ignoring the additions for check, rules ARET and ATHROW 
define the semantics of atomic blocks that return a value ARET, 
or that throw an exception ATHROW. In each case the main idea is 
that the only way of performing "=+" STM transitions is to package 
up the transitions for an entire atomic block and encapsulate them 
in a single "+" I 0  transition; this is how atomicity is reflected in 
the rules. 
AnSTMtransitionhastheformM;O,A,R =+ N;O1,A',R1. 
It defines a transition within a single thread from state M to N .  
Once again, O is the state of the heap and R holds the invariants 
that we return to in Section 4.2. 
The role of delta (A) is more subtle: it records the allocation 
effects of the transition. For instance, rules READ, WRITE and 
NEW are concerned with primitive accesses to TVars and their 
main effect is to return a value from the heap (Q(r) in READ), 
or to update the heap (Q[r H MI in WRITE). However, notice 
that as well as adding a new mapping to O, NEW also adds it to A.  
The reason for tracking allocation effects is the design choice 
that ATHROW rolls back the heap updates that a transaction makes 
when it terminates by an exception, but that it continues propagat- 
ing the exception that caused the roll back. This exception may 
contain references to TVars that were allocated within the transac- 
tion and so we must retain these allocations if we are not to intro- 
duce dangling pointers. A collects up these allocation effects and 
the ATHROW rule constructs a new heap state by combining them 
with the previous heap state (Q U A'). 
The STM transition AADMIN incorporates pure computation, 
monadic bind and exception handling within transactions. 
Finally, the three rules ORI, OR2 and OR3 define the orElse  
combinator. OR1 says that M I  ' o rElse '  M2 behaves like M I  
if that returns a value. OR2 expresses says that if M I  raises an 
exception then that forms the result of the o rElse  operation. OR3 
says that if M I  completes by calling r e t r y  then we try M2 instead. 
The alert reader may be wondering why there is no rule 
ARETRY to go along with ARET and ATHROW, to account for 
the fact that an STM computation may evaluate to r e t r y .  There is 
no rule for this case. What that means is that an atomic block in 
which all o rElse  choices end in r e t r y  cannot make a series of 
STM transitions that will allow the ARET or ATHROW rules to be 
applied. To make progress, another thread must be chosen. 
4.2 Semantics of invariants 
We are now ready to extend the semantics to incorporate check. 
There are three changes: 
Firstly, the state associated with I 0  transitions and STM tran- 
sitions now includes a set of invariants R. As Figure 4 shows, the 
majority of rules treat this set in the same way as the heap O. 
Secondly, the STM transitions now include two rules for check. 
The first, CHECK1 is taken when the invariant holds at the point 
it is proposed. Above the line, the proposed invariant M evaluates 
to a r e t u r n  term in the current heap state. Below the line, the 
proposed invariant is added to R and the side effects of evaluating it 
are discarded. Note that the heap remains Q and allocation effects 
A - even if M's execution allocates new TVars there is no way that 
they can leak out because the result N is discarded. 
The second new STM transition, CHECK2, is taken when the 
invariant does not hold at the point it is proposed. Above the line, 
M evaluates to a throw term. Below the line, the exception is 




Te m iI :\' " ,f l' iII:\' I
Ihrl:ad oup vq :\1, ' CJ)









































































l' ; i \.1' > AI C A/I'" Jr.
r ll . 1 » :Y
ar ! C
\,\/ : t \! V












putChar ( . (-1, R ' . I r e t u r n  0 (4, R , 1'1 I  ( ( 1  
getchar . (-),a L r e t u r n  (3 ,R  ( ( : I ; ' /  ( " I  
f o r k T C : /  - . I , r e t u t n /  I ,  - / - 1 ,  ~ l O l : / i ~  
, {) =? return Ni; Q:, A:, 0:) 
-- ( :I I? /[.'(/ ' 1 1 
M ;  Q ,  {), R =? return N ;  Q', A ' ,  R' 3Mi E R' : (Mi;  Q', {), R' =? throw Ni; Ql,  A:, R!,) 
( A R E T 2 )  
P[atomic MI; Q , R  + P[throw Nil; (Q U A' U A:) ,  R 
r e t u r n  .I' >>= . \ /  . \' ( / ~ / . V l ~ j  catch ( r e t u r n  ;\.I:! Y - r e t u r n  .i./ ((:'.:\'/'(,'//I 1 
throw \' >>= :I,/ - ~. t h r o ~ : \ . ~  ('///h!O\J,') ca tch  (throw :I / )  , \  - - -  iv $1 ( (:.'!I '/'[>[/;? j 
re t ry >>= ;I( . . .  r e  ~ r y  (1; E'Y'l?. k' ;i ca tch  ( r e t r y j  . \  -, retiry ( (~ , ' ;~ ' / ' (~ / . / : ]  j 
e t u r n  (-)(I.'I': (-): A , R  i1'r,,,,,,ioi!l[(4'! (h!/+,'!l!)) 
e t u r n  0:: ( + : I . ,  : \ , f , .A ,R  i I : :  I ( l.l..llil /'I.) 
s t u r n  ;.;. (-)[!a . ; ; \ . j ' .  A.1. . . , \ / j ,R ir.r c: 1[071;,i'(3) (:V/,,~l!\") . . 
M ;  Q,  {), R =? return N ;  O', A', R' 
( CHECK1 ) 
IE[check MI; Q ,  A , R  a E[return 01; @,A,  (52 U { M ) )  
M ;  @ ,  { ) , R  =? throw N ;  Q', A1,R' 
( CHECK2) 
[E[checkM]; Q , A , R  + [E[throwN]; ( O U A ' ) ,  ( A U A 1 ) , R  
:\ 1 -- .'\- I 3 A ,  '. r e t u r n  . \ ;  (4'. A' ,  R' 
( ,.,I I  );\,I 1;Y ) v i'Ull'1) 
$-! , .L[! :  b ) ! A , f i  ~~:;. 1 I;\?: ( - ) ! A , n  ' !L :i\.fL corElsel  ,\,/:I: (9, 1, R . ..>. \:;; . .  ( - ) I% . A' ,  a' " 
('j! 1 , R  r e t r y ;  6)'. A1,R' 
- 0 ;  - -  .- -~ ~ .. 
r 11 .  ('OR 1 )  
,: .v/j 'orE1se1 i1,/:2j; ( 3 . A , R  . ~ > .  l.!jl.f:!j; ( - ) . l , R  
Figure 4. Operational semantics of STM Haskell. Definitions in gray form the original semantics. Definitions in black show modifications. 
keeping any allocation effects (A') that may be leaked by the 
exception. 
Finally, in the I 0  transitions, there are substantial changes to 
ARET1 (for successful atomic blocks) and a new rule ARET2 (for 
atomic blocks that break an invariant). Aside from the updates to 
R ,  ARETl adds an additional premise to the original rule: all of 
the invariants in place at the end of the atomic block must evaluate 
to return tenns. Note that we consider all Mi in R' -this will pick 
up any new invariants added during the atomic block. Also, when 
evaluating each invariant, we discard the actual value returned and 
the updates that the invariant may make to the heap and to the set 
of invariants. This mirrors our informal notion that invariants are 
checked in nested transactions that are then rolled back. 
The new rule ARET2 applies when any of the invariants eval- 
uates to a throw term. As with ATHROW, the exception is propa- 
gated, retaining allocation effects but rolling back the remainder of 
the heap. Note that by using allocation effects A' and A: we retain 
any allocations in the original atomic block and any allocations 



















A/: (0, {}, n
i\/J,I,I/Vi(.,.), n ' I
iL- fat ami c
[P" atomic
; e,{},n ~ ; e',.6.',n' :3 ; n': ;; e' {},n' ~ ;; e~,.6.~,n~) (ARET2)
lP[ ic ]; e, n ---+ lP[ i ] e .6.' .6.D, n
Admini,trativ(O transitions 1/ ,\
N » \ I
















c.\ '/'(,1 I I)
',I'1'CI[:])
CATCH3)
i STrvr tl'ansi tiOJ1~








[r Cl iF '1JJ,c:';,n
Firet r :"HIJ' M,0,i)< ,n
f /' dom( fn
C,', ,1')1"((-))




; e,{},n ~ r t r ; e',.6.',n' (CHECK1)
] e,.6.,n =? [[return ]; e,.6., nU{M})
; e,{},n ~ t r ; e',.6.',n' (CHECK2)
[checkM]; e,.6. n =? [throwN); eU.6.'),(.6.uS),n
i,I,-lOA!IN)







, i n' this
,
t,




We have implemented check as an extension to our existing pro- 
totype of STM Haskell [lo, 1 [ ]. The main point of this section 
is to demonstrate that invariants can be implemented in a practi- 
cal and scalable manner. At first sight one might have thought the 
opposite, because the specification requires that every invariant is 
checked after every atomic block, and that does not scale at all 
as the number of invariants grows. The main technical insight is 
that the very same ineclzanisrn that is already needed to support the 
STM (atomic, r e t r y ,  o rElse  etc) can be re-used to trigger the 
checking of invariants: that is, an invariant INV is only run after a 
transaction T if a variable read by INV is written by T. 
Is this technique actually consistent with the semantics of Fig- 
ure 4? Note that rule ARETI requires all invariants to complete 
successfully, whereas our implementation may skip the evaluation 
of an invariant that does not depend on a given atomic block. The 
worry is that the implementation may skip an invariant that does not 
terminate, allowing an atomic block to commit when rule ARETI 
would not apply. 
This is not a problem. In outline, suppose that an invariant 11 
would loop after an atomic block Al. If the set of TVars read by 
I1 intersects the set updated by A 1  then our implementation will 
execute I1 and the program will loop. Conversely, if the sets are 
disjoint then 11's execution will not have affected by the atomic 
block and the looping would have occurred earlier (either after a 
block that did affect 11's read set, or at the point I1 was proposed). 
In Section 5.1 we provide an overview of the original STM 
interface that we build on. We then discuss three steps in the 
implementation of check. The first step (Section 5.2) is how to 
identify the invariants that need to be checked at the end of an 
atomic block. The second (Section 5.3) is how to perform those 
checks. The third (Section 5.4) is how we extend STMCommit to 
ensure atomicity between the user's transaction and the checking 
of the invariants. 
5.1 Original STM interface 
The underlying STM is based on optimistic concurrency control: 
until it attempts to commit, a transaction builds up a private log 
recording the TVars that it has read from, the values that it has seen 
in the TVars, and the values that it proposes storing in them. 
The commit operation itself is disjoint-access parallel [14] 
(meaning that transactions accessing non-overlapping sets of TVars 
can commit in parallel) and read-parallel [7] (meaning that a set 
of transactions that have read from, but not updated, a TVar can 
commit in parallel). The commit operation is built over per-TVar 
locks implemented as part of the Haskell runtime system. Locks are 
only held during commit operations. We considered using a non- 
blocking STM derived from Herlihy et al.'s design 1121, Fraser's 
design [61 or Marathe et al.'s hybrid design [191: the indirection 
provided by TVars provides a natural counterpart to the object 
handles that these STMs use. We chose the lock-based design for 
two reasons: (i) the implementation is simpler, and (ii) the Haskell 
runtime schedules Haskell threads between a pool of OS threads 
tuned to the number of available CPUs; this removes some of the 
importance of a non-blocking progress guarantee. 
Within the multi-processor Haskell runtime system, the STM 
implementation provides an interface for managing transactions 
and ~erformino reads and writes to TVars. The interface is shown in " 
Figure 5. As usual, gray lines indicate existing parts of the interface 
and black lines indicate changes and additions'. 
'For clarity we omit the further operations support blocking and unblocking 
Haskell threads that execute re t ry  statements; these are unchanged and the 
details are orthogonal to this paper. 
/ /  Ha:; i.c 1.r.ilnsact i on oxc?c!lt i o r ~  
T 1 . o ~  *STMStart 0 
TVar *STT+?esTVar(vojd TV) 
v o ~ d  -ST:.iReadTVar(TI.o * t l o g ,  Tifar *:.> 
v0i.d STi.lbJriteTVar(TT,og * t l  ot;, 7 V h r  *r, v o i d  * v j  
// 'Transaction c o r x t  operations 
iioolcarl S'T:'lls!'alid(?'Log - t l o g )  
t ~ o o l  c!i.ln S'l'i-lCo!!:!:i t. ( ' r 1 . o ~  * l. log) 
// Invariant management 
List<Closure*> *STMGetInvariantsToCheck(TLog *tlog) 
void STMDefineInvariant(TLog *tlog, 
Closure *c, TLog *inner) 
void STMRecordCheckedInvariant(TLog *outer, 
Closure *c, TLog *inner) 
Figure 5. The STM runtime interface 
STMStart starts a new top-level transaction, returning a ref- 
erence to its transaction log. STMNewTVar, STMReadTVar and 
STMWriteTVar provide the basic operations to create, read, and 
update transactional variables. 
STMIsValid returns True if the specified transaction log is 
consistent with memory (transactions are periodically validated so 
that conflicts with concurrent transactions are guaranteed to be de- 
tected [lo]). STMCommit attempts to commit the current transac- 
tion, return True if it succeeds and F a l s e  otherwise. 
STMStartNested creates a new transaction nested within the 
specified o u t e r  transaction. STMMergeNested attempts to commit 
a nested transaction by merglng its transaction log into its parent's 
(the parent becomes invalid if the child was). Transaction logs are 
allocated in the garbage collected heap and remain private to a 
transaction until passed to STMCommit: a transaction is aborted by 
simply discarding all references to its log. 
5.2 Identifying invariants to check 
The key implementation idea is to dynamically track dependencies 
between invariants and TVars. We will illustrate this using the 
example in Figure 6(a). The figure shows two ListNode structures 
created by the newListNode function from Section 3.2. Each node 
co~nprises two TVars: one for its v a l  field and one for its next  
field. The newly allocated nodes are not linked together, so the 
next  fields both hold Nothing. Each TVar contains two fields: 
the first holds the TVar's value and the second forms the head of a 
list of dynamic dependencies on the TVar. Link structures such as 
L1-I represent the dependencies between invariants and ~ ~ a r s ~ .  
For instance, TVar TI-Val has the value 10 and no dependents, 
whereas TI-Next has the value Nothing and is depended on by 
Invariant-1.  
At runtime the invariants attached in newListNode are rep- 
resented by structures holding the closure to be checked, and a 
list of the TVars that the invariant depended on when last eval- 
uated. For instance, Invar ian t -1  is evaluated by computing 
validNode(Node-1) whose result initially depends on TI-Next 
(because the current value of that TVar is Nothing and so the 
implementation of validNode does not examine the other TVars). 
There are two sets of invariants to check at the end of an 
atomic block. Firstly, we must check any new invariants that 
3As described in our earlier paper [lo] the same list is used to represent 







































!! B B tra,nsact:i. ecut:i.oll
Log nlStar ()
aT HNew d i *v
O.id * nmeadTVar(TLog .tlog, V r t.)
oi MW it r( L tJ, g, T a t )
II Transacti ommi
b e n M Valid(TLog
bo ] "a I'i'lCO !l!!,it TLog 1; t



























( ode-1) 1- ext
s I OJ t
i s ctions.
Dependencies 
(a) Two newly allocated ListNodes with separate invariants. 
Dependencies: 1 TI-Val, T I -Nel ,  ~ 
T2-Val 
Ncde-I Node-2 
(b) Node-1 is updated to make Node-2 its successor. This triggers 
re-evaluation of I n v a r i a n t - 1  which checks that the two riodes 
are in order. Node-1's invariant now depends on three TVars. 
T1-Val TI-Next 
10 1 - 
I I 
Figure 6. Runtime structures used to associate invariants with data 
that they depend on. 
T7-Val T7-Next 
20 1 I Nothing 
I I I  I 
the block itself has proposed. Invariants are proposed by check- 
ing the invariant in a nested transaction, and if it succeeds, call- 
ing STMDefineInvariant which updates a new-invariant list 
attached to the current transaction log to include the supplied 
invariant and the dependencies established in its initial execu- 
tion. Secondly, we must check any existing invariants that de- 
pend on TVars that the block intends to update. The function 
STMGetInvariantsToCheck in Figure 5 returns a single list con- 
taining both sources of invariants for the current transaction. Con- 
sider what happens when a transaction attempts to update T1-Next 
to link the two list nodes together - the update to TI-Next means 
that STMGetInvariantsToCheckjust returns Invariant-1.  
L1-2 Ll-3 LZ-1 
5.3 Checking invariants 
Following the semantics of check, each invariant in the list re- 
turned by STMGet InvariantsToCheck must be confirmed to exe- 
cute without raising an exception. This is done by iterating through 
the list and running each invariant in its own new tra~lsactio~l nested 
within the user's transaction. 
If a check fails then the user's transaction is aborted and the 
exception indicating the failure is propagated4. If a check suc- 
ceeds, then the invariant's closure and the nested transaction's log 
is passed to the STM through STMRecordCheckedInvariant. As 
described in the next section, the purpose of this call is to allow 
STMCommit to update the invariant's dependencies and to ensure 
that the whole set of invariant checks appear to take place atomi- 
cally with the user's transaction. 
Unlike the operational semantics, our runtime system does not need to 
track the allocations that are made. This is because STMNewTVar places 
new TVars directly in the garbage collected heap. 
10. L.o(:/< o.rzr-flog /L.(LI..Y 
f o r  each u:?e~--tl.og 1~og e n t r y :  
i f  t h e  e n t r y  is an upda te :  
t r y  t.o lock  the L?ar 
i f  s u c c e s s f u l  and cur renr  value na tches  e n t r y :  
cont inue 
e l s e  : 
unl~ock t v a r s  and a b o r t  
i f  t h e  e n t r y  is a  r e a d :  
r e c o r d  t v a r ' s  versiorl number 
15. Lock mars related to invariants 
f o r  each i n v a r i a n t  touched 
f o r  each t v a r  i n  c u r r e n t  dependence s e t :  // I1 
t r y  t o  lock  t h e  t v a r  
i f  unsuccess fu l  : 
unlock t v a r s  and a b o r t  
f o r  each t v a r  i n  proposed dependence s e t :  // I 2  
t r y  t o  lock  t h e  t v a r  
i f  s u c c e s s f u l  and c u r r e n t  va lue  matches t h a t  
r e a d  when checking t h e  i n v a r i a n t :  
cont inue 
e l s e  : 
unlock t v a r s  and a b o r t  
20,  ~.' / lc(: / i  1.?(1</.! 
f o r  each u s e r - t l o g  e n t r y :  
i f  t h e  e n t r y  i s  a r e a d  then  
re - read  t h e  t v a r ' s  vers ion  nnmber 
i f  t h i s  matches t h e  one we recorded:  
corit inue 
e l s e  : 
unlock t v a r s  and abor t  
25. Update invariant dependencies 
f o r  each i n v a r i a n t  touched 
f o r  each t v a r  i n  c u r r e n t  dependence s e t :  
un l ink  t v a r  from i n v a r i a n t  
f o r  each t v a r  i n  proposed dependence s e t :  
l i n k  t v a r  t o  i n v a r i a n t  
r e t a i n  c u r r e n t  dependence s e t  a s  o l d  s e t  
i n s t a l l  proposed dependence s e t  a s  c u r r e n t  s e t  
-30. !k/(,lLt, iip(/~l/t, .s 
f o r  each u s e r - t l o g  e n t r y :  
il' Lhe e n t r y  is  an upda te :  
s t o r e  neu val.ue t o  t v a r ,  ur~locking t h e  t v a r  
35. Unlock mars related to invariants 
f o r  each i n v a r i a n t  touched 
f o r  each t v a r  i n  o l d  dependence s e t :  / /  I 4  
unlock t h e  t v a r  i f  s t i l l  locked 
d i s c a r d  o l d  dependence s e t  
f o r  each t v a r  i n  c u r r e n t  dependence s e t :  // I 5  
unlock t h e  t v a r  i f  s t i l l  locked 
Figure 7. Committing a transaction with invariant checking. 
5.4 Ensuring atomicity 
We now consider the changes made to STMCommit. The underly- 
ing commit operation follows a pattern typical of many STM de- 
signs 171: it acquires temporary ownership of the TVars that have 
been updated, it checks that TVars that have been read have not 
been modified by concurrent transactions, it applies the transac- 
tion's updates to the heap, and it finally releases ownership of the 
. tv
r
r : II 1
:
r : II 1
:
:
. ck IIse t/ [W/I'S
r ser-tlog log :
:
t.o t tva












































4 l ti s,
ti s . ar
.
. t cies
r I I 13
r :
r




. l tv t
r
r II 1




TVars that it acquired. This is shown in the gray portions of Fig- 
ure 7. 
We extend this design with three additional steps shown in 
black in the figure. The inputs to these are the values passed to 
STMRecordCheckedInvariant, comprising the invariants' clo- 
sures and the new dependence information from the transaction 
logs from the invariants' execution. 
Step 15 ensures that STMCommit locks the TVars on which the 
invariant previously depended (loop I l ) ,  and the TVars it accessed 
when checked (loop 12). Note that some of these TVars may have 
already been locked in step 10, and that loop I 2  must check the 
TVars' current values to ensure that the check is still up-to-date. 
While holding these locks, step 25 updates the dependence 
information between the TVars and the invariants. 
Finally, step 35 releases any locks that have not already been 
released in the existing step 30. 
'There are a number of design choices here. In particular, we 
chose to acquire all of the TVars in the dependence sets in loops 
I1 and 12. This serves two purposes: (i) the locks acquired in 
both loops protect the updates made in step 25, and (ii)  the locks 
acquired in loop I1 also act as an implicit lock on the invariant. This 
is necessary to serialize concurrent user transactions attempting 
updates to distinct TVars on which the same invariant depends. 
An alternative design would explicitly lock invariants and use non- 
blocking lists to record the dependence between invariants and 
TVars. A non-blocking STMCommit algorithm could be developed 
by using helping in the usual way: all of the information needed by 
STMCommit is present at the start of the operation and can be made 
available through a descriptor in shared memory. 
5.5 Garbage collection 
The runtime structures in Figure 6 allow the memory occupied by 
invariants to be reclaimed automatically by the garbage collector: 
since there is no global list of invariants, each invariant becomes 
unreachable when all of the TVars it depends on become unreach- 
able. 
However, note that the links from invariants to TVars can extend 
the lifetimes of individual TVars that are not ordinarily reachable 
by the application. For instance, if TI-Val is reachable by the 
application then the dependency links through Invar ian t -1  will 
cause TI-Next and T2-Val (and everything reachable from them) 
to be retained even if the list nodes themselves are no longer 
reachable by the application. 
6. Predicates over state pairs 
Having seen this implementation, recall our problematic example 
from Section 3.3: what if we want to express a property over pairs 
of states ("XYZ never decreases") rather than a property of a single 
state ("XYZ is never zero")? 
One could express such properties succinctly by allowing the in- 
variant to read the "old" value of XYZ directly. Providing this ability 
is rather simple, because tlze STM mechanism already retains XYZ S 
old value in case the transaction is rolled back, and so we can read- 
ily expose this value to the invariant check. 
We can see two main approaches. The first is to provide a 
function to explicitly read the previous value from a TVar: 
readTVarOld : :  TVar a -> STM a 
However, while this is suitable for simple cases it requires separate 
functions to be used for access to the pre-transactional state. An 
alternative is to provide a mechanism for n~nning an existing STM 
computation against the pre-transactional state: 
o ld  : : STM a -> STM a 
Using o l d  we can express our example non-decreasing TVar as: 
newNonDecreasingTVar : :  I n t  -> STM TVar I n t  
newNonDecreasingTVar v a l  
= do 1 r <- newTVar v a l  
; check (do C c-val  <- readTVar r 
; p-val <- old  (readTVar r )  
; a s s e r t  (p-val <= c-val)  
1) 
; r e t u r n  r ;  
> 
As with invariant checks in general, there are design choices to be 
made over what kinds of operations can be performed in an o ld  
computation. In fact, the same problems from Section 3.5 occur 
and, unsurprisingly, the two broad solutions from Section 3.6 and 
Section 3.7 are possible - that is, the o l d  computation can either 
be run in its own transaction against the pre-transactional state, or 
the o l d  computation can be statically restricted to just performing 
a series of readTVar operations. In the restricted setting we can 
give o ld  the following type: 
o l d  : :  STM ReadOnly a -> STM e a 
As with check, this means that o ld  can only be supplied with a 
ReadOnly STM action formed from readTVar operations and pure 
computation. 
However, there are two additional problematic cases. Firstly, an 
o l d  computatioil may try to read from a TVar that was allocated 
during the current transaction. 'This is straightforward to handle in 
our implementation because these allocation effects are kept dis- 
tinct from the transaction's subsequent updates: the o l d  computa- 
tion will see the value with which the TVar was initialized. 
The second problematic case is whether o l d  should be usable 
outside an invariant check. Doing so could harm modularity be- 
cause it allows an STM-typed function to depend on the starting 
state of the atomic block it occurs in, not just the state that it is 
called from. 'This is ultimately a matter of taste since there is no 
implementation reason to prevent such usage. However, if desired, 
we could restrct o l d  to just being used in invariant checks by refin- 
ing its type to: 
o l d  : :  STM ReadOnly a -> STM ReadOnly a 
The use of ReadOnly on the right hand side means that the action 
can only be performed in a context expecting a ReadOnly STM 
action - i.e. ultimately within an invariant check. 
It is technically straightforward to add o ld  to the semantics of 
Figure 4 but we omit the details because it is syntactically verbose: 
the state carried into and between STM transitions would have to 
include the pre-transactional state (0) captured in the A l U 3  rules. 
7. Related work 
'This paper builds on two main areas of existing work: (i) incorpo- 
rating invariants in programming languages, and (ii) incorporating 
invariants in databases. We discuss these in Sections 7.1 and 7.2 
respectively. 
7.1 Invariants in programming languages 
Many languages and tools have provided ways to express invariants 
over data. Gypsy and Alphard programs can include specifications 
for use by formal methods [8, 261. CLU [18], ESCIModula3 [4], 
ESCIJava 151 and JML 1171 include specifications in stylized com- 
ments for processing by tools. 
Euclid, Eiffel and Spec# are notable for embedding specifica- 
tions in the same language that is used for programming. An impor- 
tant design decision in all of these languages is how to generalize 
rs t t it i . i i i t ti i
r .
t t i i it t r iti l t i
l i t i re. i t t t t l t
RecordChecked1nvari , '
r s t i f r ti fr t tr ti
l s ' ti .
t 5 t Commit





i r ti t .
i lly, t
l i ti .
T . ,
i
i 2. is : I
t l s ii)
ir i i t.
is r li
t t .
lt r ti i
l i t
rs. l i it
i l : l
Commit t t
il le r .
. rbage i
i ri ts
i r s l i t , t
ll
le.
r, t t t
t li t i i l
li ti . , i l




. e i tes
i i
r ti . :
t t s r
t t ( Z
l
i t l tl .
is t i le, h 's
l e i ,




ever, il t i i it l r i l it i t
ti t r t t t ti l t t .
lt r ti is t i i m ing i ti
t ti i t t t ti l t t :
l ::
i l l i :
i 1 1
r i r l
{ r - r l
; ( { _val - r r r






















i i t i t . i t i ti . .
ti ly.
. i t i i l
l t l r i t r i ri t
.
l t [ , ]. [ ], l ,
/ [ ] [ ] i l i i ti i t li
ts f r r ssi t ls.
li , i l t l i i i
ti s i t s l t t is s f r r r i . i r-
t t i i i i ll f t l i t r li
invariants to be able to refer to multiple objects in the presence of 
aliasing. For instance, suppose that an invariant on a list states that 
it only contains positive-valued integers. It is insufficient to check 
this each time a node is added to the list because, in general, the 
contents of a node may subsequently be updated via another refer- 
ence to it. 
Euclid, Eiffel, Spec# and our own work all take different ap- 
proaches to this problem. As we introduced in Section I ,  a contri- 
bution of our approach is that we allow invariants to be dejined dy- 
iiamically (rather than, say, associated with class definitions), and 
that we allow them to depend on arbitrary mutable state (rather 
then, say, only on the fields of the current object). 
Euclid includes explicit a s s e r t  statements, pre- and post- 
conditions on routines, and invariants on modules5 1161. An in- 
variant must remain true during the module's lifetime, except for 
when routines exported from the module are executing. Although 
these invariants could be written as boolean-typed Euclid expres- 
sions, they were generally expected to be checked by verification 
rather than checked at runtime [22] and so language mechailisms 
to control updates to data that an invariant depends on are not re- 
quired. 
The Eiffel language supports class-based invariants which must 
be satisfied by every instance of the class whenever the instance is 
externally accessible; that is, immediately after creation, and before 
and after any call to an exported routine of the class 1131. Invari- 
ants are boolean-typed Eiffel expressions. Note that invariants are 
explicitly checked before calls as well as after them: this will de- 
tect changes that may have been made to objects that the invariant 
depends on. 
Spec# extends C# with several features to encourage robust 
programming [ I ] .  These include class invariants that are required 
to hold on every instance of the class while it is not "exposed". A 
new construct expose ( 0 )  i: S 1 allows the invariant of o to be 
temporarily broken within the statements S, but it must be restored 
by the end of those statements; objects can only be updated while 
exposed in this way. Furthermore, a hierarchical object-ownership 
discipline is used to ensure that the invariant of one object depends 
oilly on the state of that object and objects that it (transitively) 
owns. This means that an object's invariant cannot be broken by 
uncoiltrolled updates to objects that it depends on. In concurrent 
settings, the same hierarchy can be used to associate locks with 
aggregate objects 1 151. 
7.2 Invariants in databases 
Stonebraker iiltroduced the idea of defining integrity constraints 
for a database independently from the basic requirements of its 
schema [23]. He described simple constraints on individual fields 
("Employee salaries must be positive"), constraints on fields in 
the same row of a table ("Everyone in the toy department must 
make more than $8000"). and more complex constraints involving 
joins across tables ("Employees must earn less than two times the 
sales volume of their department if their department has a positive 
sales"). These constraints were expressed as a special form of 
query, and then enforced by combining them with database updates 
in such a way that an update cannot change data in a way that 
violates a constraint. 
In the POSTQUEL query language, Stonebraker et ai. intro- 
duced a more general system that supported integrity constraints 
and computation triggered by database updates 1241. Their system 
allowed existing commands to be tagged "always" or "refuse". A11 
"always" command can be used to trigger updates when related 
data is modified, e.g. "Always replace Mike's salary with Bill's". 
In Euclid, module is a type constructor; many instances of a module can 
exist dynamically. 
Conceptually they run continuously: when first executed, the com- 
mand runs until it ceases to have an effect, whereupon it is re-run 
whenever data that it has read or written to is updated. A "refuse" 
command can be used to enforce integrity constraints ("refuse to 
add an employee whose salary is more than $30k") or for security 
("refuse to retrieve Mike's salary when logged in as Bill"). 
Cohen introduced "consistency rules" in the transactional lisp- 
derived query language AP5 121. This design is the closest to our 
own: dl accepted trailsactions had to satisfy all of the constraints 
that were defined. Transactions were defined by series of queries 
grouped by an atomic [ . . . I construct; constraints could be 
violated within the atomic block, but had to be restored by the end 
of the block. Cohen's design allowed a user to specify whether or 
not a constraint had to be true at the point at which it was declared. 
The SQL:92 query language supports various kinds of con- 
straint definition 131. In particular, arsertions can be general con- 
straints involving an arbitrary collection of columns from an arbi- 
trary collection of tables. For instance, "no supplier with status less 
than 20 can supply any part in a quantity greater than 500": 
CREATE ASSERTION supply CHECK 
( NOT EXISTS ( SELECT * FROM S 
WHERE S.STATUS < 20 
AND EXISTS 
( SELECT * FROM SP 
WHERE SP.SN0 = S.SNO 
AND SP.QTY > 500 ) ) ) 
Checking of constraints can be deferred within transactions and 
performed upon commit: if any constraint fails then the transaction 
fails and is rolled back. 
8. Conclusion 
The key ideas of this paper are to extend atomic blocks with a 
mechanism to dynamically define an invariant over arbitrary mu- 
table state and to re-use the STM machinery to track the depen- 
dence between transactions and that state. The result is that the sys- 
tem provides the appearance that every committed atomic block 
preserves every invariant, while only re-evaluating invariants that a 
given block actually appears to have changed. 
Some concluding observations: 
Erasure. A frequent point of discussion about this work is 
whether invariants should be used to detect operations that are 
attempted when the system is 'not ready' for them - either in- 
dicating this explicitly by using r e t r y  within an invariant (as in 
Section 3.4). or by catching an exception raised by an invariant 
failure. 
A possible benefit of this approach is code brevity: perhaps an 
application would include duplicate checks, one within the imple- 
mentation of a transaction to check whether or not it is ready to 
run, and the second within an invariant attached to the data struc- 
tures that are being modified. 
Conversely, relying on invariants to coiltrol execution in this 
way makes it impossible to disable invariant-checking once a pro- 
gram has been debugged, and harms modularity because there is no 
external indication of whether or not a library operation requires in- 
variant checking to be enabled. 
This, we feel, provides a strong argument for keeping invari- 
ants for bug detection clearly distinct from similar operations that 
form part of the application's logic. An interesting approach (sug- 
gested by an anonymous reviewer) is to follow the database dis- 
tinction between arsertiorzs and triggers: triggers are considered 
part of the application logic and may be used to maintain invariants 


























































ine a trigger-like construct that could also use r e t r y  to defer the 
commit of a transaction when the system is not ready for it. 
Expressiveness. We have shown how STM lets us extend invari- 
ant checks to include executable predicates over the before and af- 
ter memory states of the transaction, rather than just the after state. 
This does raise the question of whether there are further kinds of 
invariant that would be useful to programmers but which cannot be 
expressed in our system. In principle there are some: nothing de- 
pending on three or more successive states can be expressed solely 
using invariant checks because any side effects incurred by check- 
ing invariants are rolled back. 
We have considered one further possible design that increases 
the expressiveness of the properties that can be described solely by 
checks. The idea is to allow check statements to add new invari- 
ants to the system, even though we roll back ordinary updates that 
checks make to the heap. For instance, a 'non repeating TVar' that 
cannot take the same value more than once could be implemented 
by one invariant check that adds further checks each time a new 
value is seen. This is more expressive, but perhaps ultimately im- 
practicable in many cases. There is one subtlety: any new invariants 
must themselves be checked against the post-transactional state as 
well as the state when check was called. This ensures that the com- 
plete set of invariants holds at the end of the transaction and that the 
set is closed under the re-execution of any invariant. 
We have held back from actually implementing this more com- 
plicated design because, in practice, we think it is an open question 
as to whether there are usefrrl properties that cannot be captured 
by our current design while still being suitable for expressing by 
executable specifications. 
Application to other languages. It is easy to see how these ideas 
could be applied to a language other than STM Haskell. However, 
there are two issues that we would like to highlight. Firstly, our use 
of dynamically-defined invariants benefits from Haskell's support 
for closures: our examples in Section 3 showed how concise invari- 
ants depended on variables from enclosing scopes. Secondly, STM 
Haskell is notable in that the type system constrains where mutable 
state can be accessed: it is guaranteed that the only updates to trans- 
actional variables occur within atomic blocks. This lets us ensure 
that invariants are re-evaluated when necessary. In other languages 
it will be necessary to consider whether such a segregation is valu- 
able. 
Acknowledgments 
The ideas in this paper have benefited greatly from discussion with 
the Spec# group and, in particular, we thank Daan Leijen, Mike 
Barnett and Ben Zorn for the the ideas of readTVarOld, old,  and 
the use of phantom types. 
References 
[I] BARNETT, M., LEINO, K. R. M., A N D  SCHULTE, W. The Spec# 
programming system. In Proceedings of CASSIS 2004 (2004). 
[2] COHEN, D. Compiling complex database transition triggers. In 
SIGMOD '59: Proceedings ofthe 1989 ACM SIGMOD interncrriontrl 
conference on Management of data (New York, N Y ,  USA, 1989). 
ACM Press, pp. 225-234. 
[3] DATE, C. J.,  A N D  DARWEN, H. A guide to the SQLstandard, 4th ed. 
Addison-Wesley, 2000. 
[4] DETLEFS, D. L., LEINO, K. R. M., NELSON, G., A N D  SAXE, J. B. 
Extended static checking. Tech. Rep. Research Report 159, Compaq 
SRC, Dec. 1998. 
[5] FLANAGAN, C., LEINO, K. R. M., LILLIBRIDGE, M., NELSON, 
G.,  SAXE, J. B., A N D  STATA, R. Extended static checking for Java. 
In PLDI '02: Proceedings of the ACM SIGPLAN 2002 Conference on 
Programming hngutlge design and implernentution (New York, N Y ,  
USA, 2002), ACM Press, pp. 234-245. 
[6] FRASER, K. Practical lock freedotn. PhD thesis, University of 
Cambridge Computer Laboratory, 2003. 
[7] FRASER, K.,  A N D  HARRIS, T. Concurrent programming without 
locks. Under submission. 
[8] GOOD, D. I., COHEN, R. M., A N D  HUNTER, L. W. A report on the 
development of Gypsy. In ACM 78: Proceedings of the 1975 annl~al 
conference (New York, NY, USA, 1978), ACM Press, pp. 116122. 
[9] HARRIS, T., A N D  FRASER, K. Language support for lightweight 
transactions. In Object-Oriented Progrumming, Sysrerns, hngauges 
& Applicatior~s (OOPSLA '03) (Oct. 2003). pp. 38-02, 
[lo] HARRIS, T., HERLIHY, M., MARLOW, S. ,  A N D  PEYTON JONES, 
S. Composable memory transactions. In Proceedings qf the ACM 
Syrnposiurn on Principles and Practice of Parallel Programming, to 
appear (June 2005). 
[ I ] ]  HARRIS, T., MARLOW, S., A N D  PEYTON JONES, S. Haskell on a 
shared-memory multiprocessor. In Haskell '05: Proceedings of the 
2005 ACM SIGPLAN workshop on Huskell (Sept. 2005), pp. 49-61. 
[12] HERLIHY, M.,  LUCHANGCO, V., MOIR, M., A N D  SCHERER, 111, 
W. N. Software transactional memory for dynamic-sized data 
structures. In Proceedings of the 22nd Annual ACM Sytnpo.siurn or1 
Principles of distributed computing (2003). ACM Press, pp. 92-101. 
1131 INTERNATIONAL STANDARD ECMA-367, E. Eiffel analysis, design 
and programming language, June 2005. 
[14] ISRAELI, A , ,  A N D  RAPPOPORT, L. Disjoint-access-parallel 
implementations of strong shared memory primitives. In Proceedings 
of the I3tll Annual ACM Symposiutn on Principles of Distributed 
Computing (Aug. 1994), pp. 151-160. 
1151 JACOBS, B., LEINO, R., A N D  SCHULTE, W. Safe concurrency for 
aggregate objects with invariants. In Proceedings of SEFM 2005. 
[I61 LAMPSON, B. W., HORNING, J. J., LONDON, R. L., MITCHELL, 
J .  G.,  A N D  POPEK, G. J .  Report on the programming language 
Euclid. SIGPLAN Nor. 12.2 (1977), 1-79. 
1171 LEAVENS, G.  T.. RUBY, C., RUSTAN, K.,  LEINO, M., POLL, 
E., A N D  JACOBS, B. JML (poster session): notations and tools 
supporting detailed design in java. In OOPSLA '00: Addendr~rn to the 
2000proceedings of the conference on Object-oriented programming, 
systems, Innguages, and applications (Addendum) (New York, NY, 
USA, 2000). ACM Press, pp. 105-106. 
[IS] LISKOV, B. A history of CLU. In HOPL-11: The second ACM 
SIGPLAN conference on History of programming 1angrlage.s (New 
York, NY, USA, 1993). ACM Press, pp. 133-147. 
[19] MARATHE, V. J., SCHERER 111, W. N., A N D  SCOTT, M. L. Adaptive 
software transactional memory. Technical report TR-868, Department 
of Computer Science, University of Rochester, May 2005. 
[20] MARLOW. S., PEYTON JONES, S., A N D  THALLER, W. Extending 
the Haskell Foreign Function Interface with concurrency. In 
Proceedings of rhe ACM SIGPLAN workshop on Haskell (Snowbird, 
Utah, USA, September 2004). pp. 57-68. 
[21] MEYER, B.  Systematic concurrent object-oriented programming. 
Cornrn~m. ACM 36,9 (1993). 5 U 0 .  
[22] POPEK, G. J . ,  HORNING, J. J.,  LAMPSON, B. W., MITCHELL, 
J. G., A N D  LONDON, R. L. Notes on the design of Euclid. In 
Proceedings of an ACM conference on Language design for reliable 
sofnvnre (1977), pp. 11-18. 
[23] STONEBRAKER, M. Implementation of integrity constraints and 
views by query modification. In SIGMOD '75: Proceedings of the 
1975 ACM SIGMOD international conference on Munugetnenr of 
data (New York, NY, USA, 1975). ACM Press, pp. 65-78. 
1241 STONEBRAKER, M., A N D  ROWE, L. A. The POSTGRES papers. 
Tech. rep., Berkeley. CA, USA, 1987. 
[25] WADLER, P. The essence of functional programming. In Conference 
record of the Nineteenth Annual ACM SIGPLAN-SIGACT Syrnposirtrn 
on Principles of Progranlnling Lnnguuges: puper.s presented trt tlze 
sytnpo.siurn, Albuquerque, New Mexico, January 19-22, 1992 (New 
York, N Y ,  USA, 1992), ACM, Ed., ACM Press, pp. 1-14. 
1261 WULF, W. A, ,  LONDON, R. L., A N D  SHAW, M. An introduction 
to the construction and verification of Alphard programs. In ICSE 
'76: Proceedings of the 2nd international conference on Sofrware 
engineering (Los Alamitos, CA, USA, 1976), lEEE Computer Society 
























1 RNET , INO, HULTE,
f .
] HEN, . .
8 : t ati al
f , , ,
4.
] TE, J., RWEN, .
l , .
] LEFS, . , INO, . LSON, E, .
. . ,
, .
ANAGAN, INO, iLLIBRIDGE, LSON,
., , TA, . .
: f
lang a mentation , ,
, 5.
] ASER, . m
] SER, ., RRIS,
.
] D, 1 HEN, NTER, .
. : f 8 ual
), . 6- .
RRIS, , ASER,
a i , t ms, Langaug
ns A 3) . , 8-402.
10 RIS, RLIHY, RLOW, ., YTON ES,
. of
m m f , 10
1l] RRIS, RLOW, YTON NES, .
r. : f
a t. .
] RLIHY, , CHANGCO, v , IR, HERER, ,
f l m m n
f , .
[ ] , l i ,
] AELI, ., PPOPORT,
f 1 th l m
, 0.
[IS] COBS, INO, HULTE,
. f .
1 ] PSON, . w., RNING, NDON, . ITCHEL ,
. ., PEK, . .
i . t. , ), .




, la es, endum) ,
, 6.
18] iSKOV, . -ll:
f la u s
, .
] RATHE, . HERER III, . TT, .
.
] RLOW, YTON NES, AL ER,
.
f t
, , . .
] YER, .
mmun. , 6-8 .






a a m t f
, , , , .
[ [ ONEBRAKER, E, .
. , . , .
] DLER, .
f m um
f mm a a a ers a h
.Iympo m, , ,
, ), , . .





Sequential Specification of Transactional Memory Semantics * 
Abstract 
Michael L. Scott 
Department of Computer Science 
University of Rochester 
scott@cs.rochester.edu 
commit(t) Attempt to commit transaction t and return a Boolean 
indication of success. The call is said to succeed iff it returns 
true. 
Transactional memory (TM) provides a general-purpose mech- 
anism with which to construct concurrent objects. Transactional abort(t) Abandon transaction t. No return value. 
memory can also be thought of as a concurrent object, but its se- 
mantics are less clear than those of the objects typically constructed 
on top of it. In particular, commit operations in a transactional 
memory may fail when transactions conflict. Under what circum- 
stances, exactly, is such behavior permissible? 
We offer candidate sequential specifications to capture the se- 
mantics of transactional memory. In all cases, we require that reads 
return consistent values in any transaction that succeeds. Each spec- 
ification embodies a conflict function, which specifies when two 
transactions cannot both succeed. Optionally, a specification may 
also embody an arbitrationfunction, which specifies which of two 
conflicting transactions must fail. In the terminology of the STM 
literature, arbitration functions correspond to the concept of con- 
tention management. 
We identify TM implementations from the literature corre- 
sponding to several specific conflict and arbitration functions. We 
note that the specifications facilitate not only correctness (i.e., lin- 
earizability) proofs for nonblocking TM implementations, but also 
formal comparisons of the degree to which different implementa- 
tions admit inter-transaction concurrency. In at least one case- 
eager detection of write-write conflicts and lazy detection of read- 
write conflicts-the formalization exercise has led us to semantics 
that are arguably desirable, but not, to the best of our knowledge, 
provided by any current TM system. 
1. Modeling STM 
We can model a transactional memory as a mapping from objects 
to values. Initially all values are undefined. The memory supports 
the following operations: 
start(t) Begin transaction t .  No return value. 
read(o, t )  Return the current value of object o in the context of 
transaction t .  Return the distinguished value I if o is uninitial- 
ized. 
write(o, d ,  t )  Write d to o in the context of transaction t .  No 
return value. 
* Presented at TRANSACT the First ACM SIGPLAN Workshop on Lan- 
guages, Compilers, and Hardware Support for Transactional Computing, 
held in conjunction with PLDI, Ottawa, Ontario, Canada, June 2006. 
This work was supported in part by NSF grants CCR-0204344 and CNS- 
041 1127, financial and equipment support from Sun Microsystems Labora- 
tories, and financial support from Intel. 
'These definitions are intended to simplify correctness argu- 
ments, not to simplify programming. The richer interfaces typical 
of object-oriented software TM can be implemented in terms of 
these more basic primitives, without changing the underlying se- 
mantics. We defer discussion of such interfaces to Section 6. 
Following the terminology of Herlihy and Wing [8], a history is 
a finite sequence of operation invocation and response events, each 
of which is tagged with its arguments and return values, and with 
the id of the calling thread. In a sequential history, each invocation 
is immediately followed by its matching response, with no events 
in between. A sequential history H thus induces a total order < H  
on its operations. Throughout the rest of the paper we will consider 
only sequential histories. We define the semantics of transactional 
memory on these histories. 
A transaction is a sequence of operations, performed by a sin- 
gle thread, of the form (start (read I write)* (commit I abort)), 
where t is a unique transaction descriptor passed to start, to the 
commit or abort, and to every read or write in between. Transac- 
tions S and T in history H are said to overlap if starts < H   end^ 
and  start^ < H  ends, where  end^ is T's commit or abort opera- 
tion. Transaction T is said to be isolated in H if for all transactions 
S # T in H ,  S and T do not overlap. We say a history H is se- 
rial if it consists of a sequence of isolated transactions, optionally 
followed by a single uncompleted transaction (i.e., a transaction 
prefix). For convenience, we associate  end^ with the end of H if 
T is uncompleted (i.e., all operations in H precede the end of an 
uncompleted transaction). If S and T are both uncompleted, ends 
and  end^ are incomparable under < H .  
We assume throughout this note that all histories are well- 
formed, meaning that every thread subhistory is serial (we do not 
currently consider nested or overlapped transactions within a single 
thread). Well-formedness implies, among other things, a one-one 
correspondence between transactions and their descriptors. We also 
assume, for simplicity, that write is called no more than once for 
a given object within a given transaction. A transaction is said to 
succeed it if ends with a commit that succeeds. It is said to fail it if 
ends with a commit that fails. We use successful(H) to represent 
the history obtained by deleting from H all operations of failed, 
aborted, or uncompleted transactions. 
As defined by Herlihy and Wing, a sequential specification S of 
a concurrent object 0 is a prefix-closed set of sequential histories 
on 0. For most kinds of objects it is intuitively clear which histories 
should be in S. Intuition is less clear for transactional memory. 
Certainly we must insist that reads return the "right" value in any 
transaction that succeeds. It also seems reasonable, at least in a 
i l




























st r e r











preliminary study, to insist that a commit succeed if it ends an 
isolated transaction. But under what circumstances may a commit 
operation fail? 
To answer this question we first define, in Sections 2 and 3, a se- 
quential specification that embodies the two minimal requirements 
just suggested. Our definition is driven by the notion of a conflict 
function, which specifies the circumstances in which two transac- 
tions cannot both succeed. In Section 4 we introduce a variety of 
conflict functions, leading to a rich structure of sequential specifi- 
cations, several of which capture the semantics of published TM 
systems. We also identify an arguably attractive sequential speci- 
fication that is not, to our knowledge, embodied in any published 
system. In Section 5, we consider the notions of blocking and live- 
lock, and the extent to which they may be permitted or precluded 
by a sequential specification of TM. In particular, we introduce the 
notion of an arbitration function, which specifies, when two trans- 
actions conflict, which of them must fail. Section 6 explains how 
our model can accommodate an object-oriented API. We conclude 
in Section 7 with a summary and a list of open questions. 
2. Consistency 
We say a read operation T = read(o, t )  in history H is consistent 
if it returns the most recent committed value of o; that is, T returns 
d if there exists an operation w = write(o, d, s) in a successful 
transaction S such that (1) s # t, (2) commits < H  T ,  and (3) 
for all operations x = write(o, e, u) in transactions U # S ,  if U 
is successful, then  commit^ <H commits or T < H  commitu. 
If there is no such w, then o is uninitialized, and T returns I. 
Our definition does not make writes in T visible to subsequent 
reads in T ,  but this restriction is easily relaxed at a higher level of 
abstraction (we do so in Section 6). 
We say a history H is consistent if (1) every read in every 
successful transaction is consistent, and (2) every such read is 
still valid when its transaction commits; that is, if T = read(o, t )  
appears in a successful transaction T ,  then there exists an operation 
w = write(o, d, s) in a successful transaction S such that (a) T 
returns d, (b) s # t, (c) commits < H   commit^, and (d) for 
all operations x = write(o, e ,  u) in transactions U $! {S,T),  if 
U is successful, then commitu < H  commits or  commit^ < H  
commitu. Note that this definition permits an implementation to 
ignore the ABA problem: a read is still considered valid at commit 
time if its value has been overwritten and then restored. 
Lemma 1. In any consistent history, all reads of the same object 
in the same successful transaction return the same value. 
Proof: Immediate consequence of the validity of reads. 
Theorem 1 (Fundamental theorem of TM). If H is a consistent 
history, then so is the serial history J consisting of all and only 
the transactions in successful(H), ordered according to the order of 
their commit operations in H. 
Proof: Consider history I = successful(H). Clearly I is consistent, 
since the definition of consistency makes no reference to unsuc- 
cessful transactions. Now consider serial history J ,  consisting of all 
transactions of I ,  ordered according to the order of their commit 
operations in I .  All of J ' s  transactions remain successful, and its 
commit operations appear in the same order they did in I. More- 
over because I 's reads are valid at commit time, they remain con- 
sistent in J .  Thus J as a whole is consistent. 
In the terminology of the database community [ l l ,  Sections 
16.3 and 17.11, any history in which all reads are consistent avoids 
cascading aborts: when a transaction fails or aborts, an implemen- 
tation never has to cause other transactions to fail in order to en- 
sure consistency. Theorem 1, moreover, is equivalent to saying that 
consistent histories are strictly serializable or, equivalently, lin- 
earizable (since we never consider more than a single concurrent 
object-the transactional memory itself) [8]. There exist more re- 
laxed notions of consistency in which transactions can read stale 
values that force them to "commit in the past" or, conversely, read 
speculative values from writes that have not yet been committed; 
we do not consider such extensions here. 
3. Conflict 
Consistency alone does not capture intuition regarding transac- 
tional semantics. A history in which no transaction ever succeeds 
is certainly consistent, but the set of all such histories is not an 
appealingsequential specification. It seems reasonable to require 
a commit oueration to succeed unless its transaction T conflicts 
with some other transaction S ,  in which case at most one of them 
can succeed. 
Let 'H be the set of all (well-formed) histories, V be the set 
of all transaction descriptors, and H[,,,) be the history obtained 
by removing from H all operations that specify a transaction de- 
scriptor other than s or t ,  or that follow commit(t), abort(t), 
commit(s), or abort(s) in H. (The notation is meant to suggest 
a half-open interval: HI,,,) includes the initial portions of both s's 
and t's transactions, but is missing a suffix of the one that finishes 
last.) A conjlictfunction C is then a mapping from 'H x 2) x 2) 
to {true,false) such that (1) C ( H , s ,  t )  = C ( H ,  t ,s) ;  (2) if s = t 
or if the transactions corresponding to s and t do not overlap, then 
C ( H ,  s ,  t)  = false; and (3) if HI,,,) = I[,,,), then C ( H ,  s, t)  = 
C ( I ,  s ,  t) .  In other words, for overlapping transactions S and T ,  C 
makes its decision solely on the basis of the operations of those two 
transactions (and their interleaving) prior to the earlier of ends and 
endT. 
For convenience, we use and C ( H ,  S ,  T )  as shorthand 
for Ht,,) and C ( H ,  s ,  t) ,  respectively, where s and t are the de- 
scriptors of S and T ,  respectively. If C ( H ,  S ,  T )  = true, we also 
say that "S and T have a C conflict." 
Lemma 2. Given any conflict function C ,  history H ,  and isolated 
transaction T in H ,  there is no transaction S that conflicts with T .  
Proof: Immediate consequence of the definition of conflict. 
A history H is said to be C-respecting, for some conflict func- 
tion C ,  if (1) for every pair of transactions S and T in H ,  if 
C ( H ,  S ,  T )  = true, then at most one of S and T succeeds; and 
(2) for every transaction T in H ,  if T ends with a commit opera- 
tion, then that operation succeeds unless there exists a transaction 
S in H such that C ( H ,  S, T )  = true. Put another way, if there is 
no S that conflicts with T ,  then T's commit succeeds. 
For any given function C ,  we use the term C-based transac- 
tional memory to denote the set of all consistent, C-respecting his- 
tories. It seems reasonable to define conflict functions in a way that 
forces any C-respecting history to be consistent, but nothing about 
the definition of conflict requires this. We say that C is validity- 
ensuring if C ( H ,  S ,  T )  = true whenever there exists an object o 
and operations T = read(o, t )  in T and w = write(o, d, s)  in S 
such that Sends with a commit and T < H  commits < H  endT. 
Lemma 3. If C is a validity-ensuring conflict function and H is 
a C-respecting history in which every read is consistent, then H is 
a consistent history. 
Proof: Immediate consequence of definitions. 
Given the ABA problem, a validity-ensuring conflict function 
is sufficient but not necessary to ensure that all reads in successful 
transactions are still valid at commit time. 












I 5 =1= ) ,









5 =1= co itT

































} I , , ;
5
t is ') s . t
, 5 . a ,
H[S,T) , , )
[s .) , 5 , 5







, , ) an
,
i , , )
,
-







Overlap conflict: Transactions S and T in history H conflict if 
S and T overlap. Overlap-based TM thus consists of all histories 
in which every isolated transaction is successful and no two over- 
lapping transactions are both successful. 
Lemma 4. For any conflict function C ,  history H ,  and transac- 
tions S and T in H, if S and T have a C conflict, they also have an 
overlap conflict. 
Proof: Immediate consequence of the definition of conflict func- 
tion. 
Theorem 2. For any conflict function C ,  C-based TM is a se- 
quential specification. 
Proof: By the definition of sequential specification, we need only 
show that C-based TM is prefix-closed. Suppose the contrary: 
there exists some history H E C-based TM and some H prefix 
P @ C-based TM. There are two cases to consider. First, suppose 
there exist two successful transactions S and T that conflict in 
P but not in H .  Since T is successful in P ,  P must include 
commitT, which implies that = Hls ,~) .  But this implies 
that C ( P ,  S ,  T )  = C ( H ,  S ,  T ) ,  a contradiction. Second, suppose 
there exists some failed transaction T that has an excuse to fail in 
H but not in P .  There must exist some transaction S i n  H such that 
C ( H ,  S, T )  = true but C ( P ,  S, T )  = false. Since T fails in P ,  P 
must include  commit^, which implies that PLs,~)  = But 
this implies that C ( P ,  S ,  T )  = C ( H ,  S, T ) ,  a contradiction. 
4. Requiring concurrency 
Overlap-based TM is a very weak specification; it admits an imple- 
mentation in which overlapping transactions are never successful. 
An implementation might, for example, employ global counts of 
the number of started and active transactions. Operation start(t) 
would increment both counts and remember the started count; 
commit(t) would decrement the active count and return true iff 
the result were zero and the started count were equal to the remem- 
bered value. 
To require that certain non-isolated transactions succeed, we 
must refine our definition of conflict, so more transactions are seen 
to be conflict-free. As a first step, we might insist that readers be 
permitted to proceed concurrently. (Remember here that we are still 
talking about sequential histories. Our goal is to increase concur- 
rency among transactions, not [in this note] among individual op- 
erations.) 
Writer overlap conflict: Transactions S and T conflict in history 
H if they overlap and one performs a write before the other ends. 
Most TM systems go further, allowing transactions to proceed 
concurrently if they do not perform conflicting accesses to the same 
object: 
Lazy invalidation conflict: Transactions S and T conflict in 
history H if there exist operations r = read(o, t )  in T and w = 
write(o, d ,  s) in S such that S ends with a commit operation and 
r < H  commits < H   end^. In other words, S and 2' conflict if S 
attempts to commit, and allowing it to succeed would invalidate a 
read in T. 
Eager W-R conflict: Transactions S and T conflict in history H 
if (1) S and T have a lazy invalidation conflict or (2) there exist 
operations r = read(o, t )  in T and w = write(o, d ,  s) in S such 
that w < H  r < H  ends. In other words, beyond the requirements 
of lazy invalidation conflicts, S and T conflict if a read in T is 
"threatened" by a previous write in S ;  that is, if w precedes r and 
the prefix of H that ends at r can be extended to create a history in 
which r is invalidated by w. 
... Ll 
A: lazy invalidation 
... R 
A or \ / B: eager W-R . . . 
. . . 
A or /lx : C: mixed invalidation 
. . . R 
... 
B or D: eager invalidation ... R 
Figure 1. Alternative definitions of conflict. Arrows indicate his- 
tory order. A straight terminator indicates a commit operation. A 
curved terminator indicates that a transaction may optionally be un- 
completed. 
Eager invalidation conflict: Transactions S and T conflict in 
history H if (1) S and T have an eager W-R conflict or (2) there 
exist operations T = read(o,t) in T and w = write(0, d ,  S) 
in S such that r < H  w < H   end^. In other words, beyond the 
requirements of eager W-R conflicts, S and T conflict if a read in 
T is threatened by a subsequent write in S;  that is, if w follows 
r and the prefix of H that ends at w can be extended to create a 
history in which T is invalidated by w. 
These definitions of conflict are illustrated graphically in Fig- 
ure 1. None of them defines writes to the same object as conflict- 
ing: writes do not become visible to other transactions until com- 
mit time, and the fact that some other transaction is planning to 
update an object at some point in the future is harmless. Of course 
if a transaction updates an object-reading its value before writing 
it-then a concurrent write is indeed a conflict. Under the object- 
oriented API of Section 6, every write will be an update. 
Note the asymmetry of eager W-R conflict: w would also 
threaten r if r < H  w < H   end^, but we do not define this as 
a conflict. The rationale for this asymmetry is that in a practical 
implementation a transaction must detect conflict with previous ac- 
tivity in some other transaction. The "other half" of eager invalida- 
tion, shown in Figure ID, requires that readers be visible to writers. 
In practice, this in turn requires that readers modify some sort of 
metadata, inducing cache conflicts among readers that would not 
otherwise occur. 
Lemma 5. Lazy invalidation conflict is the weakest consistency- 
ensuring conflict function. 
Proof: Immediate consequence of definitions. 
Claim (Proof omitted). The OSTM of Hanis and Fraser [I], with 
appropriate API adjustments (see Section 6) is an implementation 
of lazy invalidation-based TM. The DSTM of Herlihy et al. [7], 
with appropriate API adjustments and visible readers, is an imple- 
mentation of eager invalidation-based TM. If it were augmented 
to permit validation of reads whose objects were subsequently ac- 


























, 5, ) , 5, , ,
l
. 5 i
, 5, ) , 5, ) .
co itT [S,T) HIS,T).










, , 5 5

























"validating through"), DSTM with invisible readers would be an 
implementation of eager W-R-based TM.' 
Note that the sets of histories induced by different conflict 
functions are generally incomparable. Consider, for example, the 
sequence of operations start(s) start(t) write(o, d ,  t )  read(o, s)  
commit(s) commit(t). If this sequence is executed in isolation, the 
read must return I. The return values of the commits, however, 
will depend on the choice of conflict function: the transactions 
with descriptors s and t have an eager W-R conflict, but not a 
lazy invalidation conflict. The set of all lazy invalidation-respecting 
histories will include exactly one history corresponding to this 
sequence of operations: one in which both commits return true. 
The set of all eager W-R-respecting histories will include one in 
which both commits fail and two in which one succeeds but the 
other fails. 
Eager W-R conflict gives transactions more excuses to fail than 
lazy invalidation conflict does (and eager invalidation conflict gives 
still more). In a practical implementation these extra excuses-may 
or may not be a good thing. They are good if they allow the im- 
plementation to improve performance by heuristically abandoning 
work on transactions that are likely to fail (but see Section 5 below); 
they are bad if they allow the implementation to neglect opportuni- 
ties for parallel speedup. 
An implementation that uses a hash function h to locate 
transaction metadata might introduce the notion of h-conjicting 
transactions-transactions that perform conflicting accesses to ob- 
jects in the same hash-induced equivalence class. Given a function 
h, assume some arbitrary total order on objects, and let let g(a), for 
any object a, be the smallest object b such that h(a) = h(b). Then 
for any conflict function C ,  history H ,  and transactions S and T 
in H ,  S and T would be said to have an h C  conflict if the trans- 
actions S' and T' have a C conflict, where S' and T' are obtained 
from S and T by replacing every object o in a read or write op- 
eration with its image g(o). Definitions of hC-respecting histories 
and hC-based TM would follow accordingly. 
Claim (Proof omitted). The WSTM of Hanis and Fraser [5]  
is an implementation of h-lazy invalidation-based TM for some 
appropriate hash function h. 
If overlapping transactions S and T both read and then write 
the same object o, the argument for allowing S and T to proceed 
concurrently (as lazy invalidation does) is that any history in which 
both are uncompleted can be extended to abort either and commit 
the other; there is no way for an implementation to tell, a priori, 
which transaction "ought" to fail. This is a weak argument, how- 
ever, since S and T cannot both succeed. 
If, however, one of S and T writes o but the other merely reads 
it, there is a stronger argument for allowing them to proceed con- 
currently: both can succeed if the writer commits last. To capture 
this form of concurrency we can define the following: 
Mixed invalidation conflict: Transactions S and T conflict in 
history H if (1) S and T have a lazy invalidation conflict or (2) 
there exist operations r = read(o, s)  in S, w s  = write(o, d ,  s) in 
S, and WT = write(o, e ,  t )  in T such that r <H w s  < H   end^ 
and r < H  WT < H  ends. In other words, beyond the requirements 
of lazy invalidation conflicts, S and T conflict if (a) a read in T is 
threatened by a subsequent write in S ,  (b) the read is followed by 
a write in T ,  and (c) both writes happen before either transaction 
ends. 
As implemented, DSTM with invisible readers realizes semantics only 
subtly different from eager invalidation conflict: it admits histories in which 
both S and T are uncompleted, the last operation in T reads some object o, 
and there is a subsequent write of o in S. 
Mixed invalidation conflict falls between lazy invalidation con- 
flict and eager invalidation conflict, but is incomparable to eager 
W-R conflict. More formally and completely: 
Theorem 3. The sets of transactions that have lazy invalidation, 
eager W-R, eager invalidation, and mixed invalidation conflicts are 
nested as shown on the left side of Figure 2, with each of the 
containments non-trivial. 
Proof: Simple containment is an immediate consequence of the 
definitions of the respective conflict functions. Proper containment 
is illustrated by the examples on the right side of Figure 2. 
We are currently experimenting with mixed invalidation-re- 
specting histories in our RSTM system [lo]. To the best of our 
knowledge, no other existing system currently implements these 
semantics (without also being eager W-R-respecting). 
5. Progress and arbitration 
So far our discussion has addressed only correctness: what are the 
legal histories that may be realized by an implementation? One is 
also usually interested in progress: under what circumstances, if 
any, may a thread be blocked by the state of other threads? Tradi- 
tionally progress has been discussed in the context of concurrent 
histories: when, if ever, can the response to an invocation be arbi- 
trarily delayed? For transactional memory, however, we may also 
be interested in transaction-level progress in sequential histories: 
when, if ever, can a thread suffer an arbitrarily long string of failed 
transactions? 
Consider, for example, the trivial implementation of overlap- 
based TM mentioned at the beginning of Section 4. This imple- 
mentation clearly admits blocking at the level of transactions: given 
any history H in which transaction T is uncompleted, any exten- 
sion of H in which T remains uncompleted will contain no suc- 
cessful transactions beyond the end of H .  The implementation also 
admits livelock: we can easily construct a history in which every 
thread performs an arbitrary number of commits, none of which 
succeeds. 
We define these conditions in the usual way: 
Starvation: A sequential specification S is said to be starvation- 
free if for any thread a and any history H in S there exists an n > 0 
such that in any H extension H' E S, if a performs more than n 
commit operations in H' after H ,  at least one of them will succeed. 
Livelock: A sequential specification S is said to be livelock-free 
if for any thread a and any history H in S there exists an n > 0 
such that in any H extension H' E S, if a performs more than 
n commit operations in H' after H ,  some commit operation will 
succeed in H' after H (not necessarily one of a's). 
Blocking: A sequential specification S is said to be nonblocking 
if for any thread a and any history H in S there exists an n > 0 
such that in any H extension H' E S, if all operations in H' 
after H are performed by a ,  and they include at least n commit 
operations, at least one of those commits will succeed. 
Note that these conditions are defined here at the level of trans- 
actions. If extended in the obvious way to concurrent histories of 
implementations, they yield, respectively, the familiar notions of 
wait freedom, lock freedom, and obstruction freedom [6, 81. 
Lemma 6.  For any validity-ensuring conflict function C ,  C -  
based TM admits blocking. 
Proof: Consider histories of the form Hk = R Wl W2 . . . W k ,  
where R is the 2-operation sequence start(r) read(o, r), performed 
by some thread a ,  and Wi is the 3-operation sequence start(w,) 
write(o, i ,  wi) commit(w,), performed by some thread b. Since 




























8, W , ,






































r of k I . ,
0




eager W-R but not lazy invalidation 
or mixed invalidation 
mixed invalidation but not \ /  I\ lazy invalidation or eager W-R 1 
eager invalidation but not 
eager W-R or mixed invalidation 
Figure 2. Left: containment relationships among sets of conflicting transactions. Smaller sets provide fewer excuses for a transaction to fail. 
Right: timelines illustrating histories that separate the inner sets. Arrows indicate history ,order. 
Wi. Thus given any n > 0, C-based TM contains a version of H, 
in which b performs all operations after R, including n commits, 
all of which fail. 
Note that an implementation is not required to fail all the writes 
in this example; the point is that C-based TM pennits it to do so. 
Corollary 1. For any validity-ensuring conflict function C ,  C -  
based TM admits livelock and starvation. 
If we want to ensure progress, clearly we need to insist that 
some transactions succeed even in the presence of conflicts. To do 
so, we introduce a function to arbitrate between pairs of conflicting 
transactions. We can then insist that a transaction succeed if there 
is no conflicting transaction to which it loses at arbitration. 
Where conflict is a purely local phenomenon, based only on 
the operations of the conflicting transactions, we allow arbitration 
to consider a broader context. Let Hn,,,) be the prefix of H ex- 
tending through the earlier of commit(t), abort(t), commit(s), or 
abort(s) in H .  We define an arbitration function A to be a map- 
ping from 3-t x 2) x 2) to {true, false) such that (1) A(H,  s, t)  is 
undefined if s = t ;  (2) -A(H,s, t )  + A(H, t, s) if s # t ;  and (3) 
if = IUs,,), then A(H,s ,  t )  = A(I ,  s ,  t ) .  
If transactions S and T conflict in H and A(H,  S ,  T) = true, 
transaction S must fail. It seems likely that many arbitration func- 
tions will satisfy l A ( H ,  s ,  t) ++ A(H,  t ,  s), but our definitions do 
not require this. A history H i s  said to be AC-respecting, for some 
conflict function C and arbitration function A, if (1) for every pair 
of transactions S and T in H ,  if C ( H ,  S, T )  = true, then S fails 
if A ( H ,  S ,  T )  = true, and T fails if A(H, T, S )  = true; and (2) 
for every transaction T in H ,  if T ends with a commit operation, 
then that operation succeeds unless there exists a transaction S in 
H such that C ( H ,  T ,  S )  = true and A(H, T, S )  = true. AC- 
based transactional memory denotes the set of all consistent, AC- 
respecting histories. 
Theorem 4. For any conflict function C and arbitration function 
A, AC-based TM is a sequential specification. 
Proof: Analogous to that of Theorem 2. 
As a simple example, we can extend the semantics of overlap- 
respecting histories with an arbitration function that chooses as 
victim the transaction that started first: 
Eagerly aggressive arbitration: For transactions S and T in 
history H ,  A(H, S ,  T )  = true if starts < H  start?.. 
A trivial implementation of eagerly aggressive, overlap-based 
TM might keep the descriptor of the most recently started transac- 
tion in a global variable. Operation start(t) would store t in this 
variable; commit(t) would return true iff the variable were still t .  
Lemma 7. Eagerly aggressive, overlap-based TM is nonblock- 
ing. 
Proof: Given any history H E eagerly aggressive, overlap-based 
TM and any thread a,  consider any extension H' of H composed 
entirely of operations of a after H. If H' contains two commit 
operations after H then H' contains a full transaction T of a after 
H ,  during which no other transaction starts. By the definition of 
eagerly aggressive, overlap-based TM, T must be successful. 
Eagerly aggressive, overlap-based TM retains, trivially, the vul- 
nerability to livelock of ordinary overlap-based TM. One way to 
eliminate this problem is to resolve conflicts in favor of the trans- 
action that attempts to commit first: 
Lazily aggressive arbitration: For transactions S and T in his- 
tory H ,  A(H, T ,  S )  = true if commits < H   end^ and for all 
transactions U such that commit" < H  commits, C ( H ,  U, S )  = 
false or A(H,  U, S )  = true. That is, T must fail if it conflicts with 
S ,  S commits first, and S is not itself forced to fail by some earlier 
transaction. 
Eagerly and lazily aggressive arbitration both resolve conflicts 
in favor of the thread that "discovers" the conflict. More precisely, 
in both cases the shortest history prefix in which the value of 
the arbitration function is defined ends with an operation of the 
"winning" thread. 
Theorem 5. For any conflict function C ,  lazily aggressive C -  
based TM is livelock free. 
Proof: Suppose the contrary: there exists a history H E lazily 
aggressive C-based TM, a thread a, and a prefix P of H such 
that a performs two commit operations after P in H ,  neither of 
which succeeds. Consider the second commit. Call its transaction 
T. How can T fail? By the definition of lazily aggressive arbitra- 
tion, there must be some conflicting transaction S in H such that 
commits < H   commit^ and S is not forced to fail by any earlier 
transaction U .  Moreover since C ( H ,  U, S )  considers only opera- 
tions prior to the earlier of endu and ends, S cannot be forced 
to fail by any later transaction. By the definition of arbitration, S 
must succeed. Moreover since T starts after P, S commits after P ,  
contradicting our assumption. 
NB: since sequential specifications say nothing about concur- 
rent histories, it is still possible for a concurrent implementation of 
a nonblocking, livelock-free specification to have operations that 
block or livelock. 
Theorem 6. For any validity-ensuring conflict function C ,  lazily 
aggressive C-based TM admits starvation. 
r - t t l i li ti
r i e i ali ati
i i li ti t t
lazy invalidation or eager -R
i li ti t t
i . ft: t i t l ti i t li ti t ti . ll t i t ti t il.
i t: i li . . r.
i. , n
i i l ,
ll il. 0
l








t t , t ,
rt(s) . i /
i r 'H. V V t , lse} , 5
5 ; ) ~ 5, ) --> , , 5 5 i=
i Hls,t) ls,t) , , ) ,s, .
I t 5 an , 5, )
t 5 il. -
ti ill ti f ~ (H, 5, ) <-- ( , , 5 ,
i is. i -
ti n I
5 , , 5, ) , 5
i ( , 5, ) , ils ( , , 5)
r ,
t t 5









l ressi e itrati : 5
i , ( , 5, ) tT.
i i l
t
ti i l l i l . ti t t t l t t i t i
i l ; it(t) l t t i t i l till t.
5











, , 5 f its e T
u 5
, 5 f










its co itT 5







i it t ti .
200615115
Proof: Consider histories of the form H k  = WI W2 . . . Wk, where 
Wi is the 6-operation sequence start(ai) start(b,) read(o, ai) 
write(o, i ,  b,) commit(b,) commit(ai), where all the a transac- 
tions are performed by the same thread a. Since C ensures consis- 
tency, each b transaction conflicts with the corresponding a trans- 
action. And by the definition of lazily aggressive arbitration, the a 
transaction always loses. Thus given any n > 0, lazily aggressive 
C-based TM contains exactly one version of H, in which thread 
a is never successful. 
Claim (Proof omitted). OSTM is an implementation of lazily 
aggressive, lazy invalidation-based TM. Even in the absence of 
adversarial scheduling, it admits the possibility that a thread will 
starve if it tries, repeatedly, to execute a long, complex transaction 
in the face of a continual stream of short conflicting transactions in 
other threads. 
Contention management. While it may seem natural for a se- 
quential specification to specify the outcome of conflicts, there are 
two potentially serious disadvantages to doing so. First, if we at- 
tempt to capture some nontrivial notion of fairness in our arbitra- 
tion function (based, perhaps, on how often the threads in question 
have lost at arbitration in the past), we may end up with an un- 
desirably complicated specification, or one that over-constrains the 
implementation (e.g., by requiring guarantees where heuristic or 
probabilistic assurances might be acceptable in practice). Second, 
we may preclude decisions based on factors outside the purview of 
the specification (e.g., thread priorities, processor load, or run-time 
cache performance.) 
An attractive alternative strategy is to couple a blocking or 
livelock-admitting sequential specification with an implementation 
that avoids the histories in which blocking or livelock occurs. In ef- 
fect, this is the suggestion of Herlihy, Luchangco, and Moir [6, 71, 
who argue for obstruction-free algorithms. In such an algorithm 
the implementation subsumes the role of an arbitration function, 
which can then be realized as a self-contained contention man- 
agement module. So long as it follows certain minimal rules, a 
contention manager can guarantee forward progress without the 
design and verification complexity that would be required for di- 
rect implementation of a comparable arbitration function embed- 
ded in the specification. A variety of sophisticated contention man- 
agers, several of them quite subtle, have been developed in recent 
years [2, 3,4, 12, 13, 141. 
6. Object-based API 
As noted in Section 1, our model of transactional memory is in- 
tended to simplify correctness arguments, not to simplify program- 
ming. Several extensions are useful in practice, and indeed are em- 
bodied in extant TM systems. We focus in this Section on object- 
oriented software TM systems such as DSTM [7], OSTM [I], 
ASTM [9], SXM [2], and RSTM [lo]. Our extensions are straight- 
forward optimizations and wrappers for the TM operations used in 
Sections 1 through 5; they do not change the underlying semantics. 
Simpler extensions, not presented here, would adapt our TM model 
to hardware TM proposals. 
We use each object in the TM model to represent a reference to a 
higher-level object, and require that (1) the pointer value passed to 
write is always new (created in the current transaction), and (2) the 
data to which it refers is never modified after the writing transaction 
commits or aborts. 
To avoid wasting work in a transaction that is doomed to fail, 
we provide an acquire(o, d, t) operation that does what write 
does, but returns a ~ o o l e a n  status. If the status is false, the TM has 
determined (via eager conflict detection) that a subsequent commit 
is guaranteed to fail. The transaction may then choose to call abort 
immediately, rather than proceeding. In a similar vein, open(o, 
t) takes the place of read, and returns nil (distinct from I) if a 
subsequent commit is doomed to fail. 
To eliminate the prohibition against multiple calls to write in a 
single transaction, we implement an open-w(o) operation: 
if open-w has already been called on o in this transaction 
return what it returned last time 
else 
d l  := read(o) 
d2 := pointer t o  new data 
initialized t o  be a copy o f  * d l  
if ! acquire(o, d2, t) then d2 := nil 
return d2 
The intent here is that changes to program data will be made indi- 
rectly through the reference returned by open-w. The penultimate 
line eliminates the need for explicit calls to acquire. 
By analogy to open-w, we provide a memoizing open-r(o): 
if open-r or open-w has already been called on o in this 
transaction 
return what it returned last time 
else return read(o) 
Clearly, calls to open-r always return the same value in the same 
transaction. 
Validation. While Theorem 1 ensures that successful transac- 
tions see a sequentially consistent view of memory, it does not en- 
sure that values read from different objects in a failed transaction 
will be mutually consistent-there may be no point in the serialized 
history at which those values were simultaneously valid. Absent 
complete sandboxing of transactional operations (implemented via 
compiler support or binary rewriting), inter-object inconsistency 
can compromise program correctness in potentially catastrophic 
ways. In particular, use of an invalid code or data pointer can lead 
to modification of an arbitrary (nontransactional) data location, or 
execution of arbitrary code. 
We posit a validate(o, d) operation, implemented as return 
(read(o) = d), that can be used to verify that a value is still 
valid. DSTM, ASTM, and RSTM ensure consistency automatically 
and incrementally, by having open-r and open-w call validate for 
every previously-opened object. OSTM requires the programmer 
to insert such calls by hand whenever the use of inconsistent data 
might lead to unacceptable behavior. 
7. Conclusions 
In this note we have suggested that transactional memory be viewed 
not merely as a means of implementing concurrent objects, but as 
a concurrent object in its own right. Toward that end we consid- 
ered the sequential specification of transactional memory seman- 
tics. We suggested that any intuitively acceptable specification of 
TM consist of all and only those histories in which all read op- 
erations of successful transactions return the "right" value, and no 
commit operation fails unless provided an excuse to do so by some 
well-defined conflict function, optionally augmented with an arbi- 
trationfunction. We presented a collection of conflict functions that 
overlap in nontrivial ways, inducing a rich collection of sequential 
specifications. We noted that deferring the work of an arbitration 
function to the implementation corresponds to the notion of con- 
tention management in obstruction-free STM. 
Several of our sequential specifications capture the semantics 
of published TM systems. The formalization exercise also leads us 
to suggest that mixed invalidation-based TM (eager detection of 
write-write conflicts, lazy detection of read-write conflicts) might 
be an option worth exploring in future TM systems. Regarding the 
oof l 2 .
i t(aJ rt(bJ 0 J








t, , , ],
.
.






















_ , _ 0


















formalization itself, our work suggests a variety of open questions, 
among them: 
Should we extend the notion of consistency to allow a read in a 
successful transaction to return a stale or, conversely, a not-yet- 
committed value? 
Can we characterize the circumstances under which a read in 
a failed or aborted transaction is permitted to return an "incor- 
rect" value? 
How sophisticated an arbitration function can realistically be 
embedded in a sequential specification? Are there any advan- 
tages to including it there, rather than leaving it to the imple- 
mentation? 
Can we characterize the conflict and arbitration functions that 
do or do not lead to blocking or livelock-admitting specifica- 
tions? 
Can we develop a meaningful notion of probabilistic arbitration 
functions? 
Can we create an arbitration function that precludes starvation, 
or would this require extensions to the model of Section 1 (e.g., 
to allow the specification of continuations)? 
Is there any potential benefit to extending the definition of 
conflict function to allow two non-overlapping transactions to 
conflict? This might, among other things, allow certain isolated 
transactions to fail. 
Is there any call for a weaker notion of "validity-ensuring con- 
flict function" that would exploit value-restoring (ABA) writes? 
Acknowledgments 
The ideas in this paper benefited greatly from the comments of the 
anonymous referees, and from discussions with Bill Scherer, David 
Eisenstat, Virendra Marathe, Mike Spear, and Mitsu Ogihara. 
References 
[I] K. Fraser and T. Harris. Concurrent Programming Without Locks. 
Submitted for publication, 2004. Available as research.microsoft.com/ 
-thamsldrafts/cpwl-submission.pdf. 
[2] R. Guerraoui, M. Herlihy, and B. Pochon. Polymorphc Contention 
Management in SXM. In Proceedings of the Nineteenth International 
Symposium on Distributed Computing, Cracow, Poland, September 
2005. 
[3] R. Guerraoui, M. Herlihy, M. Kapalka, and B. Pochon. Robust 
Contention Management in Software Transactional Memory. In 
Proceedings, Workshop on Synchronization and Concurrency in 
Object-Oriented Languages, San Diego, CA, October 2005. In 
conjunction with OOPSLA'05. 
[4] R. Guerraoui, M. Herlihy, and B. Pochon. Toward a Theory of 
Transactional Contention Managers. In Proceedings of the Twenty 
Fourth ACM Symposium on Principles of Distributed Computing, Las 
Vegas, Nevada, August 2005. 
[5] T. Hams and K. Fraser. Language Support for Lightweight 
Transactions. In OOPSZA 2003 Conference Proceedings, Anaheim, 
CA, October 2003. 
[6] M. Herlihy, V. Luchangco, and M. Moir. Obstruction-Free Synchro- 
nization: Double-Ended Queues as an Example. In Proceedings of 
the nventy-Third International Conference on Distributed Computing 
Systems, Providence, RI, May, 2003. 
[7] M. Herlihy, V. Luchangco, M. Moir, and W. N. Scherer 111. Software 
Transactional Memory for Dynamic-sized Data Structures. In 
Proceedings of the nventy-Second ACM Symposium on Principles 
of Distributed Computing, pages 92-101, Boston, MA, July 2003. 
[8] M. P. Herlihy and 1. M. Wing. Linearizability: A Correctness Con- 
dition for Concu~~ent  Objects. ACM Transactions on Pmgramming 
Languages and Systems, 12(3):463492, July 1990. 
[9] V. J. Marathe, W. N. Scherer 111, and M. L. Scott. Adaptive Software 
Transactional Memory. In Proceedings of the Nineteenth International 
Symposium on Distributed Computing, Cracow, Poland, September 
2005. 
[lo] V. J. Marathe, M. F. Spear, C. Heriot, A. Acharya, D. Eisenstat, 
W. N. Scherer 111, and M. L. Scott. Lowering the Overhead of 
Software Transactional Memory. In ACM SIGPL4N Workshop on 
Languages, Compilers, and Hardware Support for i9ansactional 
Computing, Ottawa, ON, Canada, July 2006. Held in conjunction 
with PLDI 2006. Expanded version available as TR 893, Department 
of Computer Science, University of Rochester, March 2006. 
[ l l ]  R. Ramakrishnan and J. Gehrke. Database Management Systems. 
McGraw-Hill, Third edition, 2003. 
[12] W. N. Scherer I11 and M. L. Scott. Contention Management in 
Dynamic Software Transactional Memory. In Pmceedings of the 
ACM PODC Workshop on Concurrency and Synchronization in Java 
Programs, St. John's, NL, Canada, July 2004. 
[13] W. N. Scherer 111 and M. L. Scott. Advanced Contention Management 
for Dynamic Software Transactional Memory. In Pmceedings of 
the nvenry-Fourth ACM Symposium on Principles of Distributed 
Computing, Las Vegas, NV, July 2005. 
[14] W. N. Scherer 111 and M. L. Scott. Randomization in STM Contention 
Management (poster paper). In Proceedings of the Ibenfy-Fourth 
ACM Symposium on Principles of Distributed Computing, Las Vegas, 
















[1] . raser a . arris. c rre t r ra i it t c s.
. ft. l
lharri /d afts/cpwl-subm ssion.pdf.
[2] , rphi
. i f





[ ] . rr i, . rli , . .





[ ] . rli , . , . ir. str ti - r r -
i f
1\ve t i t
, .
[ ] . rli , . , . ir, . . r r III. ft r
i f 1\venty-S




[9] . J. arat e, . . c erer III, a . . c tt. a ti e ft are
i f




. A s op



















Lock Inference for Atomic Sections 
Michael Hicks Jeffrey S. Foster Polyvios Pratikakis 
University of Maryland, College Park University of Maryland, College Park University of Maryland, College Park 
rnwh@cs.urnd.edu jfoster@cs.urnd.edu polyvios@cs.umd.edu 
Abstract 
To prevent unwanted interactions in multithreaded programs, pro- 
grammers have traditionally employed pessimistic, blocking con- 
currency primitives. Using such primitives correctly and efficiently 
is notoriously difficult. To simplify the problem, recent research 
proposes that programmers specify atomic sections of code whose 
executions should be atomic with respect to one another, without 
dictating exactly how atomicity enforced. Much work has explored 
using optimistic concurrency, or software transactions, as a means 
to implement atomic sections. 
This paper proposes to implement atomic sections using a static 
whole-program analysis to insert necessary uses of pessimistic con- 
currency primitives. Given a program that contains programmer- 
specified atomic sections and thread creations, our mulex infer- 
ence algorithm efficiently infers a set of locks for each atomic 
section that should be acquired (released) upon entering (exiting) 
the atomic section. The key part of this algorithm is determining 
which memory locations in the program could be shared between 
threads, and using this information to generate the necessary locks. 
To determine sharing, our analysis uses the notion of continuation 
effects to track the locations accessed after each program point. As 
continuation effects are flow sensitive, a memory location may be 
thread-local before a thread creation and thread-shared afterward. 
We prove that our algorithm is correct, and provides parallelism 
according to the precision of the points-to analysis. While our al- 
gorithm also attempts to reduce the number locks while preserving 
parallelism, we show that minimizing the number of locks is NP- 
hard. 
1. Introduction 
Concurrent programs strive to balance safety and liveness. Pro- 
grammers typically ensure safety by, among other things, using 
blocking synchronization primitives such as mutual exclusion locks 
to restrict concurrent accesses to data. Programmers ensure liveness 
by reducing waiting and blocking as much as possible, for example 
by using more mutual exclusion locks at a finer granularity. Thus 
these two properties are in tension: ensuring safety can result in 
reduced or no parallelism, compromising liveness, while ensuring 
liveness could permit concurrent access to an object (a data race) 
potentially compromising safety. Balancing this tension manually 
can be quite difficult', particularly since traditional uses of block- 
ing synchronization are not modular, and thus the programmer must 
reason about the entire program's behavior. 
Software transactions promise to improve this situation. A 
transaction is a programmer-designated section of code that should 
' As of the time this paper is written, Google returns 13,000 pdf documents 
containing the phrase "notoriously difficult", the word "software", and one 
of the words "multithreaded" or "concurrent." 
be serializable, so that its execution appears to be atomic2 with 
respect to all other transactions in the program. Assuming all 
concurrently-shared data is accessed within atomic sections, the 
compiler and runtime system guarantee freedom from data races 
and deadlocks automatically. Thus, transactions are composable- 
they can be reasoned about in isolation, without worry that an 
ill-fated combination of atomic sections could deadlock. This char- 
acteristic clearly makes transactions easier to use than having to 
manipulate low-level mutexes directly in the program. 
Recent research proposes implementing atomic sections using 
optimistic concurrency techniques [ S ,  6 ,7 ,12 ,  131. Roughly speak- 
ing, memory accesses within a transaction are logged. At the con- 
clusion of the transaction, if the log is consistent with the current 
state of memory, then the writes are committed; if not, the trans- 
action is rolled back and restarted. The main drawbacks with this 
approach are that first, it does not interact well with 110, which can- 
not always be rolled back; second, performance can be worse than 
traditional pessimistic techniques due to the costs of logging and 
rollback [9] .  
In this paper, we explore the use of pessimistic synchronization 
techniques to implement atomic sections. We assume that a pro- 
gram contains occurrences of f o r k  e for creating multiple threads 
and programmer-annotated atomic sections atomic e for protect- 
ing shared data. For such a program, our algorithm automatically 
constructs a set of locks and inserts the necessary lock acquires and 
releases before and after the body of each marked atomic section. 
A trivial implementation would be to begin and end all atomic sec- 
tions by, respectively, acquiring and releasing a single global lock. 
However, an important goal of our algorithm is to maximize par- 
allelism. We present an improved algorithm that uses much finer 
locking but still enforces atomicity, without introducing deadlock. 
We implement this algorithm in a tool called LOCKPICK, using the 
sharedness analysis performed by our race detection tool for C pro- 
grams, LOCKSMITH [ l o ] .  We present an overview of our algorithm 
next, and describe it in detail in the rest of the paper. 
1.1 Overview 
The main idea of our approach is simple. We begin by performing a 
points-to analysis on the program, which maps each pointer in the 
program to an abstract name that represents the memory pointed 
to at run time. Then we can create one mutual exclusion lock 
for each abstract name from the points-to analysis and use it to 
guard accesses to the corresponding run-time memory locations. 
At the start of each atomic section, the compiler inserts code to 
acquire all locks that correspond to the abstract locations accessed 
within the atomic section. The locks are released when the section 
concludes. To avoid deadlock, locks are always acquired according 
to a statically-assigned total order. Since atomic sections might be 
nested, locks must also be reentrant. Moreover, locations accessed 
For the remainder of the paper, we use the term "atomic" liberally, to mean 














































expressions e ::= x ( v 1 el ez ( r e f  e 1 ! e 1 el := ez 
I i f 0  eo then  el e l s e  e2 
I forki  e I atomici e 
values v ::= n 1 Xx.e 
tY Pes r ::= int IrefPr  1 ( r , ~ )  + X  ( r ' , ~ ' )  
labels 1  ::= p l s l x  
constraints C ::= 0 1 {1  5 1'1 1 C U C 
Figure 1. Source Language, Types, and Constraints 
within an inner section are considered accessed in its surrounding 
sections, to ensure that the global order is preserved. 
This approach ensures that no locations are accessed without 
holding their associated lock. Moreover, locks are not released 
during execution of an atomic section, and hence all accesses to 
locations within that section will be atomic with respect to other 
atomic sections [4]. Our algorithm assumes that shared locations 
are only accessed within atomic sections; this can be enforced with 
a small modification of our algorithm, or by using a race detection 
tool such as LOCKSMITH as a post-pass. 
Our algorithm performs two optimizations over the basic ap- 
proach. First, we reduce our consideration to only those abstract 
locations that may be shared between threads, since thread-local 
locations need not be protected by synchronization. Second, we ob- 
serve that some locks may be coalesced. In particular, if lock e is 
always held with lock e ' ,  then lock e' can safely be discarded. 
We implement this approach in two main steps. First, we use 
a context-sensitive points-to and effect analysis to determine the 
shared abstract locations as well as the locations accessed within 
an atomic section (Section 2 . 2 ) .  The points-to analysis is flow- 
insensitive, but the effect analysis calculates per-program point 
continuation effects that track the effect of the continuation of an 
expression. Continuation effects let us model that only locations 
that are used afer a call to f o r k  are shared. The sharing anal- 
ysis presented here is essentially unchanged from LOCKSMITH'S 
sharing analysis (with only the exception of context sensitivity for 
simplicity), which has not been presented formally before. 
Second, given the set of shared locations, we perform mutex in- 
ference to determine an appropriate set of locks to guard accesses 
to the shared locations (Section 3 ) .  This phase includes a straight- 
forward algorithm that performs mutex coalescence, to reduce the 
number of locks while retaining the maximal amount of paral- 
lelism. Our algorithm starts by assuming one lock per shared lo- 
cation and iteratively coarsens this assignment, dropping unneeded 
locks. The algorithm runs in time 0 ( m n 2 ) ,  where n is the number 
of shared locations in the program and m is the number of atomic 
sections. We show that the resulting locking discipline provides ex- 
actly the same amount of parallelism as the original, non-coalesced 
locking discipline, while at the same time uses fewer locks. Our 
algorithm is not optimal, because it does not always reach the min- 
imum number of locks possible. Indeed, in section 3 . 2  we prove 
that using the minimum number of locks is an NP-hard problem. 
2. Shared Location Inference 
Figure 1 shows the source language we use to illustrate our infer- 
ence system. Our language is a lambda calculus extended within- 
tegers, comparisons, updatable references, thread creation fork'  e, 
and atomic sections atomic"; in the latter two cases the i is an  
index used to refer to the analysis results. The expression fork'  e 
creates a new child thread that evaluates e and discards the result, 
continuing with normal evaluation in the parent thread. Our ap- 
proach can easily be extended to support polymorphism and poly- 
morphic recursion for labels in a standard way [ l l ] ,  as LOCK- 
SMITH does [lo], but we omit rules for polymorphism because they 
add complication but no important issues. 
We use a type-based analysis to determine the set of abstract 
locations p ,  created by r e f ,  that could be shared between threads 
in some program e. We compute this using a modified labelflow 
analysis [lo,  111. Our system uses three kinds of labels: location 
labels p ,  effects x and continuation effects E .  Effects of both kinds 
represent those locations p  dereferenced or assigned to during a 
computation. Typing a program generates labelflow constraints of 
the form 1  5 1'. Afterwards, these constraints are solved to learn the 
desired information. The constraint 1 5 1' is read "label 1 flows to 
label l'." For example, if x has type ref r ,  and we have constraints 
p' 5 p  and p" 5 p ,  then x may point to the locations p' or p" .  
Labels also flow to effects x or E ,  so for example if p  5 x then an 
expression with effect x may access location p .  
The typing judgment has the following form: 
This means that in type environment r, expression e has effect 
type r X  given constraints C .  Effect types r X  consist of a type r 
annotated with the effect x of e. Within the type rules, the judgment 
C t 1 5 1' indicates that 1  5 1' can be proven by the constraint 
set C .  In an implementation, such judgments cause us to generate 
constraint 1  5 1' and add it C .  Types include standard integer types; 
updatable reference types ref P r ,  each of which is decorated with a 
location label p ;  and function types of the form ( r ,  E )  -+X ( r ' ,  E ' ) ,  
where r and r' are the domain and range types, and x is the effect 
of calling the function. We explain E' and E on function types 
momentarily. 
The judgment C ;  E ;  r t e : r X ;  E' is standard for effect infer- 
ence except for E and E' ,  which express conlinualion effects. Here, 
E is the input effect, which denotes locations that may be accessed 
during or afer evaluation of e. The output effect E' contains loca- 
tions that may be accessed afer evaluation of e (thus all locations 
in E' will be in E ) .  We use continuation effects in the rule for f ork  e 
to determine sharing. In particular, we infer that a location is shared 
if it is in the input effect of the child thread and the output effect 
of the f o r k  (and thus may be accessed subsequently in the parent 
thread). 
In addition to continuation effects E ,  we also compute the effects 
x of a lexical expression, stored as an annotation on the expres- 
sion's type. We use effects x to compute all dereferences and as- 
signments that occur within the body of an atomic transaction. We 
cannot simply use continuation effects E ,  since those also include 
all dereferences that happen in the continuation of the program after 
the atomic section. Note that we cannot compute standard effects 
given continuation effects E .  The effect of an expression e is not 
simply its input continuation effect minus the output continuation 
effect, since that could remove locations accessed both within e and 
after it. 
Returning to the explanation of function types, the effect label 
E' denotes the set of locations accessed after the function returns, 
while E denotes those locations accessed after the function is called, 
including any locations in E ' .  
Example Consider the following program: 
..- I I 2 I I I 2
i i
00- I >.
typ T 00- fP T I T,e) --->x T ,e')
I 00- elx































I :s I' Mt r
l < l I bell s
l '." fp T,
' :s I/ :s ' l/.




TX . TX T
X t
f- l :s l l :s l'
.
l :s l .
f T,




t ; e; f- T X ; e'















l e t  x = r e f  0 i n  
l e t  y = r e f  1 i n  
x := 4; 
f orkl  (! x ;  ! Y ) ;  
/ * (1) * / 
y := 5 
In this program two variables x and y refer to memory locations. x 
is initialized and updated, but then is handed off to the child thread 
and no longer used by the parent thread. Hence x can be treated as 
thread-local. On the other hand, y is used both by the parent and 
child thread, and hence must be modeled as shared. 
Because we use continuation effects, we model this situation 
precisely. In particular, the input effect of the child thread is { x ,  y ) .  
The output effect of the fork (i.e. starting at (1)) is { y ) .  Since 
{ x ,  y )  n { y )  = { y ) ,  we determine that only y is shared. If instead 
we had used regular effects, and we simply intersected the effect 
of the parent thread with the child thread, we would think that x 
was shared even though it is handed off and never used again by 
the parent thread. 
Moreover, the system that we present in this paper does not 
differentiate between read and write accesses, hence it will infer 
that read-only variables are shared. In practice, we wish to allow 
read-only values to be accessed freely by all threads. To do that, we 
differentiate between read and write effects, and do not consider 
values that only appear in the read effects of both threads to be 
shared. 
2.1 'Qpe Rules 
Figure 2 gives the type inference rules for sharing inference. We 
discuss the rules briefly. [Id] and [Int] are straightforward. Notice 
that since neither accesses any locations, the input and output 
effects are the same, and their effect x is unconstrained (and hence 
will be empty during constraint resolution). In [Lam], the labels 
&in and that are bound in the type correspond to the input 
and output effects of the function. Notice that the input and output 
effects of Xx.e are both just E ,  since the definition itself does not 
access any locations-the code in e will only be evaluated when 
the function is applied. Finally, the effect x of the function is drawn 
from the effect of e. 
In [App], the output effect ~1 of evaluating el becomes the input 
effect of evaluating ez. This implies a left-to-right order of evalua- 
tion: Any locations that may be accessed during or after evaluating 
ez also may be accessed after evaluating e l .  The function is invoked 
after ez is evaluated, and hence ez's output effect must be €in from 
the function signature. [Sub], described below, can always be used 
to achieve this. Finally, notice that the effect of the application is 
the effect x of evaluating e l ,  evaluating ez, and calling the func- 
tion. [Sub] can be used to make these effects the same. 
[Cond] is similar to [App], where one of el or ez is evaluated 
after eo. We require both branches to have the same output effect 
E' and regular effect X ,  and again we can use [Sub] to achieve this. 
[Refl creates and initializes a fresh location but does not have 
any effect itself. This is safe because we know that location p 
cannot possibly be shared yet. 
[Derefl accesses location p after e is evaluated, and hence we 
require that p is in the continuation effect E' of e, expressed by the 
judgment C t p 5 E ' .  In addition, we require that the dereferenced 
location is in the effects p 5 X .  Note that [Sub] can be applied 
before applying [Derefl so that this does not constrain the effect 
of e. The rule for [Assign] is similar. Notice that the output effect 
of ! e is the same the effect E' of e.  This is conservative because p 
must be included in E' but may not be accessed again following the 
evaluation of ! e. However, in this case we can always apply [Sub] 
to remove it. 
. ~~ 
T f  = (Tin, &in)  j X  ( r o u t ,  & o u t )  
. . 
C; E O ;  r t e2 : T ~ ;  E' 
[Cond] 
C; E ;  r t i f  0 eo t h e n  el  e l s e  ez : -rX; E' 
C ; & ; I ? t e : T X ; & '  
IRefl C;  E ;  r t r e f  e : (refP T ) X ;  E' 
C;  E ;  r t e : (ref T ) ~ ;  E' 
C t p l E '  C t p 5 x  
[Derefl 
C ; E ; ~  t ! e :  T X ; E '  
C ;  E ;  r t el : (ref P T ) ~ ;  ~1 
C ; E ~ ; r t e z  : T X ; E ~  
[Assign] c t p z ~ z  - ~ t p l ~  
C ; E ; ~  t el  := ez : T ~ ; E Z  
~ t . i $ > E  c t E i I E  
[Fork] 
C; E ;  r t f orki e : in tx ' ;  ci 
C;  E ;  r t e : T X ' ;  E' 
[Atomic] 
C; E ;  r t atomici e : T X ' ;  E' 
Figure 2. Type Inference Rules 
[Sub] introduces sub-effecting to the system. In this rule, we 
implicitly allow and E" to be fresh labels. In this way we can 
always match the effects of subexpressions, e.g., of el and ez in 
[Assign], by creating a fresh variable x and letting ~1 5 x and 
xz 5 x by [Sub], where ~1 and xz  are effects of el  and ez. 
Notice that subsumption on continuation effects is contravariant: 
whatever output effect E" we give to e,  it must be included in its 
original effect E ' .  [Sub] also introduces subtyping via the judgment 
C t T 5 T' ,  as shown in Figure 3. The subtyping rulesare standard 
except for the addition of effects in [Sub-Fun]. Continuation effects 
are contravariant to the direction of flow of regular types, similarly 
to the output effects in [Sub]. 
[Fork] models thread creation. The regular effect X' of the 
fork is unconstrained, since in the parent thread there is no effect. 
The continuation effect E L  captures the effect of the child thread 
evaluating e,  and the effect E' captures the effect of the rest of the 
parent thread's evaluation. To infer sharing (discussed in section 
[Id] C. C:' r x . T f- x . T X , c:, , ,. ,
[Int] C. C' r f- n . intx . c:
1 1 • 1
c; c:; r f- e : T X ; c:'
[Ref] ----'--'----------'----,--
; c: f- : c:'
c; c: f- P )X; c:'
f- p :::; c:' f- :::; X
; c:; r f- e : X ; c:'
c; c: f- fP )X; C:l
;C:l;r f- e2: X;C:2
C f- p :::; C:2 C f- P :::; X
c;c:;r f- : 2: X;C:2
ref]
c;c:~;r f- e: TX;C:~
C f- c:~ < c: f- c: i < c:
r ] ---.::...::=---------'=--
; c:; f- i X'; : i
c;c:;r f- e: TX;C:'
C f- T :::; Tl C f- X :::; Xl C f- c:" :::; c:'
C; c:; r f- e : 7~1; c:"
i ]
c; C:j f- Xi j c:'
[ t ic] ---.:.......;'----------'-----,.--
; c:; f- i Xi ; C:'
[Sub]
c; c:; r f- el : Tlun ; C:l
' Jun ( i ,cin) ---+x (T t,cout)
C;C:l;r f- e2: Tiy;';C:in
[APP]--~-----7----­
C; c:; r f- el e2 : T;ut; C:out
c; c:; r f- eo : intX; C:o
c;c:o;r f- el : TX;c:'
c;c:o;r f- : X;c:'
eond] -----'------'--------'-----,---




, X Xl :::; X






























2 2 ' C:i
. ],
















. , , }
» }.






C k int 5 int 
C k p l < p Z  C k 7 1 < 7 2  c k T 2 < 7 1  
[Sub-Ref] 
C k ref 71 < ref P 2  rn 
Figure 3. Subtyping Rules 
2.2) we will compute E :  n E ~ ;  this is the set of locations that could 
be accessed by both the parent and child thread after the fork. 
Notice that the input effect E: of the child thread is included in 
the input effect of the f o r k  itself. This effectively causes a parent to 
"inherit" its child's effects, which is importantfor capturing sharing 
between two child threads. Consider, for example, the following 
program: 
l e t  x = r e f  0 i n  
f o r k l  (! x ) ;  
I * (1) * I 
f o r k 2  ( x  := 2 )  
Notice that while x is created in the parent thread, it is only ac- 
cessed in the two child threads. Let p be the location of x.  Then p 
is included in the continuation effect at point (1), because the effect 
of the child thread f o r k 2  x := 2 is included in the effect of the call 
at (1). Thus when we compute the intersection of the input effect 
of f o r k l  ! x with the output effect of the parent (which starts at 
(I)), the result will contain p, which we will hence determine to be 
shared. 
Finally, [Atomic] models atomic sections, which have no effect 
on sharing. During mutex inference, we will use the solution to the 
effect X' of each atomic section to infer the needed locks. Notice 
that the effect of a t o m i c '  e is the same as the effect of e;  this will 
ensure that atomic sections compose properly and not introduce 
deadlock. 
Soundness Standard label flow and effect inference has been 
shown to be sound [8, 111, including polymorphic label flow in- 
ference. We believe it is straightforward to show that continuation 
effects are a sound approximation of the locations accessed by the 
continuation of an expression. 
2.2 Computing Sharing 
Similarly to standard type-based label flow analysis, we apply the 
type inference rules in Figures 2 and 3, which produce a set of label 
flow constraints C. One can think of these constraints as forming a 
directed graph, where each label forms a node and every constraint 
1 < 1' is represented as a directed edge from 1 to 1'. Then for each 
label 1 ,  we compute the set S(1) of location labels p that "flow" to 1 
by transitively closing the graph. The total time to transitively close 
the graph is O ( n 2 ) ,  where n is the number of nodes in the graph. 
(Given a polymorphic inference system, we could compute label 
flow using context-free language reachability in time cubic in the 
size of the type-annotated program). 
Unlike standard type-based label flow analysis, our label flow 
graph includes labels E to encode continuation effects. Recall that 
we define input and output continuation effects E , E '  for every 
expression e in the program. In the solved points-to graph, the flow 
solutions of E ,  E' include all location labels that are accessed by the 
continuation of the program after the expression e;  the solution of 
E moreover includes the effect of e .  
Once we have computed S ( E )  for all effect labels E ,  we visit 
each f o r k i  in the program. Then the set of shared locations for the 
program shared is given by 
shared = U ( S ( E ~ )  n S(E, ' ) )  
1 
In other words, any locations accessed in the continuation of a 
parent and its child threads at a f o r k  are shared. 
3. Mutex Inference 
Given the set of shared locations, the next step is to compute a 
set of locks used to guard all of the shared locations. A simple 
and correct solution is to associate a lock l p  with each shared 
location p E shared. Then at the beginning to a section a t o m i c i  e ,  
we acquire all locks associated with locations in xi .  To prevent 
deadlock, we also impose a total ordering on all the locks, acquiring 
the locks in that order. 
This approach is sound and in general allows more parallelism 
than the na~ve approach of using a single lock for all atomic sec- 
t i o n ~ . ~  However, a program of size n may have O ( n )  locations, 
and acquiring that many locks would introduce unwanted overhead, 
particularly on a multi-processor machine. Thus we would like to 
use fewer locks while maintaining the same level of parallelism. 
Computing a minimum set of locks is NP-hard, as shown in sec- 
tion 3.2. We propose an efficient but non-optimal algorithm based 
on the following observation: if two locations are always accessed 
together, then they can be protected by the same mutex without any 
loss of parallelism. 
DEFINITION 1 (Dominates). We say that accesses to location p 
dominate accesses to location p', written p > p', ifevery atomic 
section containing an access to p' also contains an access to p. 
We write p > p' for strict domination, i.e., p > and p # 
Thus, whenever p > p' we can use p's mutex to protect both p 
and p'. Notice that the dominates relationship is not symmetric. For 
example, we might have a program containing two atomic sections, 
a t o m i c  (! x; ! y )  and a t o m i c  ! x.  In this program, the location of 
x dominates the location of y but not vice-versa. Domination is 
transitive, however. 
Computing the dominates relationship is straightforward. For 
each location p, we initially assume p > p' for all locations 
p'. Then for each a t o m i c i  e in the program, if p' E S ( x i )  but 
p @ S ( x i ) ,  then we remove our assumption p > p'. This takes 
time O(mlshared1) for each p, where m is the number of atomic 
sections. Thus in total this takes time O(mlshared12) for all loca- 
tions. 
Given the dominates relationship, we then compute a set of 
locks to guard shared locations using the following algorithm: 
ALGORITHM 2 (Mutex Selection). Computes a mapping L : p - 
efrom locations p to lock names 1. We call L a mutex selection 
function. 
I .  For each p E shared, set L ( p )  = l p  
2 .  For each p E shared 
3. Ifthere exists > p, then 
4 .  For each p'' such that L(P") = l p  
5. L ( ~ ~ / )  := e,, 
If we had a more discerning points-to analysis, or if we acquired the locks 
piecemeal within the atomic section, rather than all at the start [9], we would 
do even better. We consider this issue at the end of the next section. 
I -----;;::-:----:--,------::-:---:--
I- t ::::; t
I- P1 ::::; P2 I- 7 ::::; 72 C I- 7 7----'--''-=-':::'':-----=--:::---=-=---=-:=---.::....::=---=---
I- fPl 71 ::::; refP2 T2




















l ::::; II l l'.























I, 2': I, e
i





l i I Xi )






1. ) f p
.
. l pi






In each step of the algorithm, we pick a location p and replace all 
occurrences of its lock by a lock of any of its dominators. Notice 
that the order in which we visit the set of locks is unspecified, 
as is the particular dominator to pick. We prove below that this 
algorithm maintains maximum parallelism, no matter the ordering. 
Mutex selection takes time 0(lshared12), since for each location p 
we must examine L for every other shared location. 
The combination of computing the dominates relationship and 
mutex selection yields mutex inference. We pick a total ordering on 
all the locks in range(L). 'Then we replace each atomic' e in the 
program with code that first acquires all the locks in L ( S ( x a ) )  in 
order, performs the actions in e, and then releases all the locks. Put 
together, computing the dominates relationship and mutex selection 
takes O(mlshared12) time. 
Examples To illustrate.the algorithm, consider the set of accesses 
of the atomic sections in the program. For clarity we simply list 
the accesses, using English letters to stand for locations. For illus- 
tration purposes we also assume all locations are shared. For a first 
example, suppose there are three atomic sections with the following 
pattern of accesses 
Then we have a > b, a > c,  and b > c. Initially L(a)  = e,, 
L(b) = eb, and L(c)  = e,. Suppose in the first iteration of 
the algorithm location c is chosen, and we pick b > c as the 
dominates relationship to use. Then after one iteration, we will 
have L ( c )  = eb. On a subsequent iteration, we will eventually pick 
location b with a > b, and set L(b)  = L(c)  = L ( a )  = e,. It is 
easy to see that this same solution will be computed no matter the 
choices made by the algorithm. And this solution is what we want: 
Since b and c are always accessed along with a,  we can eliminate 
b's lock and c's lock. 
As another example, suppose we have the following access 
pattern: 
Then we have a > c and b > c. The only interesting step of the 
algorithm is when it visits node c. In this case, the algorithm can 
either set L(c)  = la or L(c)  = eb. However, la and Cb are still kept 
disjoint. Hence upon entering the left-most section la is acquired, 
and upon entering the right-most section eb is acquired. Thus the 
left- and right-most sections can run concurrently with each other. 
Upon entering the middle section we must acquire both la and eb- 
and hence no matter what choice the algorithm made for L(c) ,  the 
lock guarding it will be held. 
This second example shows why we do not use a na~ve approach 
such as unifying the locks of all locations accessed within an atomic 
section. If we did so here and we would choose L ( a )  = L(b) = 
L ( c ) .  This answer would be safe but we could not concurrently 
execute the left-most and right-most sections. 
3.1 Correctness 
First, we formalize the problem of mutex inference with respect to 
the points-to analysis, and prove that our mutex inference algorithm 
produces a correct solution. Let S,  = S ( X ' ) ,  where X' is the effect 
of atomic section atomici e. 
DEFTNITION 3 (Parallelism). Theparallelism of aprogram is a set 
P = { ( i ,  j )  I S, n Si = 0 )  
In other words, the parallelism of a program is the set of all pairs of 
atomic sections that could safely execute in parallel, because they 
access no common locations. 
We define the parallelism allowed by a given mutex selection 
function L similarly, where we overload the meaning of L to apply 
to sets of locations and return sets of mutexes: L(Si)  = { L ( p )  I 
P E S i ) .  
DEFTNITION 4 (Parallelism of L). The parallelism of a mutex se- 
lection@nction L : p + e, written P ( L ) ,  is dejned as 
P ( L )  = { ( i ,  j )  I L(Si)  n L ( S i )  = 0 )  
The parallelism P ( L )  is the set of all possible pairs of atomic sec- 
tions that could execute in parallel because they have no common 
associated locks. Let L be the mutex selection function calculated 
by our algorithm. The objective of mutex inference is to compute 
a solution L that allows the maximum parallelism possible without 
breaking atomicity. 
LEMMA 1. I ~ L ( P )  = !?,I, then p' > p. 
PROOF. We prove this by induction on the number of iterations 
of step 2 of the algorithm. Clearly this holds for the initial mutex 
selection function Lo(p) = e,, where we mark the function L 
that the algorithm has computed so far, with a subscript denoting 
the current iteration. Then suppose it holds for L k ,  the selection 
function after k iterations of step 2. For an arbitrary pl E shared, 
there are two cases: 
1. If Lk(p1) = ep then Lk+l(p1) = e,,. By induction p 2 
pl,  and since > p by assumption, we have p' > pl by 
transitivity. 
2. Otherwise, there exists some pz such that L k ( p l )  = Lk+l(pl)  = 
e,, , and hence by induction pz > pl. 
LEMMA 2 (Correctness). IfL is the mutex selection@nction com- 
puted by the above algorithm, then P ( L )  = P. 
In other words, the algorithm will not let more sections execute 
in parallel than allowed, and it allows as much parallelism as the 
uncoalesced, one-lock-per-location approach. 
PROOF. We prove this by induction on the number of iterations of 
step 2 of the algorithm. For the base case, the initial mutex selection 
function Lo(p) = e, clearly satisfies this property, because there is 
a one-to-one mapping between each location and each lock. For the 
induction step, assume P = P ( L k )  and for step 2 we have > p. 
Let Lk+l be the mutex selection function after this step. Pick any i  
and j .  Then there are two directions to show. 
P ( L k f l )  C P Assume this is not the case. Then there exist 
i ,  j  such that ( i ,  j )  E P ( L k + l )  and ( i ,  j )  @ P .  From the latter 
we get Si n Sj  # 0. Then clearly there exists a E S, i l  S j ,  
and since Lk+l is a total function, there must exist an e such that 
Lk+l(pl') = e. But then (i, j )  @ P(Lk+l )  since Lk+l(S,) i l  
Lk+ l (S j )  # 0. Therefore P(Lk+l )  C_ P. 
P(Lk+l )  > P Assume this is not the case. Then there exist 
i ,  j  such that ( 2 ,  j )  @ P ( L L + ~ )  and ( 2 ,  j )  E P .  From the lat- 
ter we get Si Si = 0.  Also, from the induction hypothesis 
Lk(Si) n L k ( S j )  = 0 ,  and we have Lk+l(S,) = Lk(S,)[ep H 
e,,,], and similarly for L k f l ( S i ) .  Suppose that e, 6 Lk(S,) and 
e, 6 L k ( S j ) .  Then clearly Lk+l(S,) n Lk+ l ( S i )  = 0, which 
contradicts ( i ,  j )  @ P ( L k + l ) .  
Otherwise suppose without loss of generality that e, E Lk(Si) .  
Then by assumption C, $Z Lb(S j ) .  So clearly the renaming [e, H 
e , ]  cannot add e,, to Lk+l (S j ) .  Thus in order to show Lk+l (S i )n  
Lk+l(Si)  = 0 ,  we need to show e,, $Z Lk(Si) .  Since C, E 
Lk(Si) ,  we know there exists a p" E Si such that Lk(pl') = e,, 
which by Lemma 1 implies p 2 p". But then from p' > p we have 
E Si. Also, since Si n Si = 0 ,  we have @ Si .  So suppose for 









, , . ) a,
b) , c) c .
) .

















. i i), i
i
I I I e f





I I ). f -
fu --7 , ), fi
) , i) (Sj) }
)







I. Pl) Pl) p" :2:
P p' P ' :2: Pl
t .
P2 t (Pl) l(Pl)








i ti t , ( ) f r st p' .
l .
.
(L k+l ) S;; P
, k+ l ) , if: .
i 1'= p" i n j,
l ,
") . if: l ) i) n
l(Sj) 1'= k+l} S;; .
( l ) :2 .
i if: (Lk+l} i .
n Sj = 0. lso, fro the induction hypothesis
( i) (Sj) , i) (Si)[ep f-->
p'j + j). p rf. ( i)
p rf. (Sj). i) l j)
, if: k+l}.
p (Si).
ep rf. k(Sj). r p f-->
p'] pl l(Sj). +l (Si)n
( j) , pl rf. ( j). ep
(Si), " ") p,
I :2: ' P
p' i. j , p' rf. j.
pl ( j). P'" j
200615116
edge between vi and vj. Figure 4(b) shows the program created for 
the graph in figure 4(a). 
(a) A simple graph. 
atomica {xab := 1;  xac := 2) 
atomicb {xab := 3; xae := 4) 
atomicC {xac := 6; xbc := 7; xed := 5) 
atomicd {xed := 8 )  
(b) The corresponding atomic transactions. 
Figure 4. Reduction Example 
such that Lk(p'") = ep,. But then by Lemma 1, we have > P"'. 
Then E S j ,  a contradiction. Hence we must have e,, @ Lk(Sj), 
and therefore Lk+l(S,) n Lk+l(S,) = 0, which again contradicts 
(i, j )  $! P(Lk+l).  Therefore P(Lk+l) 2 P. 
Although our algorithm maintains the maximum amount of paral- 
lelism, it may use more than the minimum number of locks. Ideally, 
we would like to solve the following problem: 
DEFINITION 5 (k-Mutex Inference). Given a parallel program e 
and an integer k, is there a mutex selectionfinction L for which 
Irange(L)l = k and P ( L )  = P? 
From this, we can state the minimum mutex inference problem. 
DEFINITION 6 (Minimum Mutex Inference). Givenaparallelpro- 
gram e, jind the minimum is k for which there a mutex selection 
finction L having Irange(L)l = k and P ( L )  = P. 
However, it turns out that the above problem is NP-hard. We 
prove this by reducing minimum edge clique cover to the mutex 
inference problem. 
DEFINITION 7 (Edge Clique Cover of size k). Given a graph G = 
(V, E) ,  and a number k, is there a set of cliques Wl , . . . , Wk C V 
such that for every edge (v, v') E ,  there exists some Wi that 
contains both v and v'? 
DEFINITION 8 (Minimum Edge Clique Cover). Given a graph 
G = (V, E) ,  jind the minimum k for which there is an edge clique 
cover of size k for G. 
LEMMA 3. Minimum Mutex Inference is NP-hard. 
PROOF. The proof is by reduction from the Minimum Edge Clique 
Cover problem. Specifically, given a graph G = (V, E ) ,  we can 
construct in polynomial time a program e such that there exists a 
mutex selection function L for e for which Irange(L)l = k and 
P ( L )  = P if and only if there exists an edge clique cover of size k 
for G. 
The construction algorithm is: 
For every vertex vi E V, create an atomic transaction a i .  
For every edge (vi, vj)  E E, create a fresh global location pij, 
and add a dereference of pi, in the body of both ai and a j .  
Note that the only location that can be accessed in both of two 
atomic transactions ai and aj is p,, , since there can be only one 
case + Suppose that there exists a selection function L and an 
integer k, such that Irange(L) I = k. Then we can construct an edge 
clique cover Wl,  ..., Wk for G, where W, C V for 1 5 i 5 k. We 
construct these sets as follows. For every lock ei E range(L), we 
construct the set Wi V by adding to Wi all vertices v j  such that 
ei E L(aj) .  Here by L ( a j )  we mean the set of locks computed by 
applying L to every p dereferenced in a j .  To prove Wl,  ..., Wk is 
an edge clique cover, we must show that each Wi is a clique on G, 
and that all cliques cover E. 
The first claim is easily proved by contradiction: assume W, is 
not a clique on G = (V, E ) ;  then there exists a pair of vertices 
urn, vn E Wi such that the edge (v,, v,) $! E. In that case, 
there is no location p, created by the reduction algorithm that 
is accessed in both a, and a,. In that case, we have by definition 
that (m,  n )  E P, i.e., a, and a, can be executed in parallel. But, 
since v,, v, Wi, we get by construction of W, that there must 
exist a lock ti such that ti E L(a,) and e, E L(a,). This would 
mean that (m,  n )  $! P(L) ,  because both a, and a, acquire t i .  
Hence, we get P (L)  # P, a contradiction. 
We also claim that the set of cliques Wi, 1 5 i < k covers all 
the edges in E .  To prove this, assume that it does not: Then there 
exists an edge (v,, v,) E E, but there is no clique Wi covering 
that edge: i.e., there is no W, such that v, E W, and v, E Wi, 
for 1 5 i < k. By construction we have that the location p, is 
accessed in both atomic transactions a, and a,. By the definition 
of L, there must be a lock ti such that L(p,,) = ti. Since both 
a, and a, access p,,, the lock e, is held during both. In that 
case, there exists a clique W, that contains both v, and v,. This 
contradicts the assumption, therefore all edges in E are covered by 
the cliques Wl, ..., Wk. 
To illustrate, suppose the lock selection function L for the pro- 
gram of Figure 4(b) uses 3 locks to synchronize this program, as 
follows: 
Then the clique cover we construct for the graph for this mutex 
selection will include 3 cliques, one per lock in the range of L. Wl 
will include all the atomic sections that must acquire el, which is 
a,b and c; W2 will include a ,  b, and c and W3 will include c and d. 
Together, Wl, W2, and W3 form an edge clique cover of size 3. 
case + Suppose there exists an edge clique cover Wl ,  ..., Wk for 
the graph G. Then we can construct a mutex selection function L 
for e such that Irange(L)l = k and P ( L )  = P. We do this as 
follows. For every clique Wi we create a lock e,. Then for every 
urn, V, E Wi we set L(pmn) = t i .  
Clearly, range@) = k. It remains to show P ( L )  = P. First, 
we show P C P(L) .  Let (m,  n )  E P, meaning that two atomic 
blocks a, and a, in the constructed program e can run in parallel, 
or a, and a, do not access any variable in common. Therefore, 
by construction of the program e, graph G cannot include the edge 
(v,, v,). This means that there is no clique Wi containing both 
v, and v,. Then, there is no lock ei that is held during both a, 
and a,, which gives (m, n )  P (L) .  Now we show P ( L )  C P. 
If (m,  n )  P ( L )  then there is no lock ei that is held for both a, 
and a,. From the construction of L we get that there is no clique 
Wi that contains both v, and v,, therefore there is no edge in G 
between v, and v,. So, there is no common location p, accessed 
by a, and a,, which means (m,  n )  P. 
For example, the graph of Figure 4(a), has a 2-clique cover 
(which is also the minimum): Wl = {a, b, c )  and W2 = {c ,  d ) .  




X ; X ae }
b
X ; X ae }
e
ae ; X e ; X }
d
X d }
k ili) £ " p' 2: pili.
p' 5j £pl (j k 5j
k 1 5i ) 1(5j )












, , , f 1 •. k ~













• V , .

















Vm ,V Vm ,Vn ) r/:. . ,
mn
m n.
, ' , m n
Vm ,Vn E i i
£ t£ m ) £i n).





i Vm i Vn i,
:::; mn
m n.
£ Pmn) £ .
m n mn, £i .





; 2 , 3 .
, 1 2 3 .
-¢= 1 . , k
I ' .
. £i
Vm , n P ) £
(L . '
' ~ . ' ,
m n
m n
m , n ).
V m n . £ m
n, E . ~ ' .
, E t t i l £i t t i l t m
n.
V m n ,
Vm Vn. , mn
m n, , E ' .
1
1 , } 2 , }.
200615116
would use 2 mutexes; e; to protect xab, xbc and x,,, and to 
protect x,d. 
Finally, the complexity of constructing a mutex inference prob- 
lem e given a graph G = (V, E )  is obviously O(IV( + (El),  and 
the complexity of constructing an edge clique cover given a mutex 
selection function L on e is obviously O(k . (VI). 
To sum up, we have shown that edge clique cover is polyno- 
mially reducible to mutex inference. Since Minimum Edge Clique 
Cover is NP-hard, we have proved that Minimum Mutex Inference 
is also NP-hard. 
4. Discussion 
One restriction of our analysis is that it always produces a finite 
set of locks, even though programs may use an unbounded amount 
of memory. Consider the case of a linked list in which atomic 
sections only access the data in one node of the list at a time. In 
this case, we could potentially add per-node locks plus one lock 
for the list backbone. In our current algorithm, however, since 
all the lock nodes are aliased, we would instead infer only the 
list backbone lock and use it to guard all accesses to the nodes. 
LOCKSMITH [lo] provides special support for the per-node lock 
case by using existential types, and we have found it improves 
precision in a number of cases. It would be useful to adapt our 
approach to infer these kinds of locks within data structures. One 
challenge in this case is maintaining lock ordering, since locks 
would be dynamically generated. A simple solution would be to 
use the run-time address of the lock as part of the order. 
Our algorithm is correct only if all accesses to shared locations 
occur within atomic sections [4]. Otherwise, some location could 
be accessed simultaneously by concurrent threads, creating a data 
race and violating atomicity. We could address this problem in two 
ways. The simplest thing to do would be to run LOCKSMITH on 
the generated code to detect whether any races exist. Alternatively, 
we could modify the sharing analysis to distinguish two kinds of 
effects: those within an atomic section, and those outside of one. If 
some location p is in the latter category, and p E shared, then we 
have a potential data race we can signal to the programmer. 
Our work is closely related to McCloskey et al's AutoIocker 
[9], which also seeks to use locks to enforce atomic sections. There 
are two main differences between our work and theirs. First, Au- 
tolocker requires programmers to annotate potentially shared data 
with the lock that guards that location. In our approach, such a 
lock is inferred automatically. However, in Autolocker, program- 
mers may specify per-node locks, as in the above list example, 
whereas in our case such fine granularity is not possible. Second, 
Autolocker may not acquire all locks at the beginning of an atomic 
section, as w e  do, but rather delay until the protected data is actu- 
ally dereferenced for the first time. This admits better parallelism, 
but makes it harder to ensure the lack of deadlock. Our approaches 
are complementary: our algorithm could generate the needed locks 
and annotations, and then use Autolocker for code generation. 
Flanagan et al [3] have studied how to infer sections of Java 
programs that behave atomically, assuming that all synchroniza- 
tion has been inserted manually. Conversely, we assume the pro- 
grammer designates the atomic section, and we infer the synchro- 
nization. Later work by Flanagan and Freund [Z] looks at adding 
missing synchronization operations to eliminate data races or atom- 
icity violations. However, this approach only works when a small 
number of synchronization operations are missing. 
We are in the process of implementing our mutex inference 
algorithm as part of a tool called LOCKPICK, which inserts locking 
operations in a given program with marked atomic transactions. 
LOCKPICK uses the points-to and effect analysis of LOCKSMITH 
to find all shared locations. The analysis extends the formal system 
described earlier to include label polymorphism, adding context 
sensitivity. LOCKPICK uses a C type attribute to mark a function 
as atomic. For example, in the following code: 
i n t  f o o ( i n t  arg) - - a t t r i b u t e - - (  (a tomic)  ) ( 
// a t o m i c  code 
> 
the function f oo  is assumed to contain an atomic section. 
We expect LOCKPICK will be a good fit for handling concur- 
rency in Flux [I], a component language for building server ap- 
plications. Flux defines concurrency at the granularity of individ- 
ual components, which are essentially a kind of function. The pro- 
grammer can then specify which components (or compositions of 
components) must execute atomically, and our tool will do  the rest. 
Right now, programmers have to specify locking manually. We plan 
to integrate LOCKPICK with Flux in the near future. 
5. Conclusion 
We have presented a system for inferring locks to support atomic 
sections in concurrent programs. Our approach uses points-to and 
effects analysis to infer those locations that are shared between 
threads. We then use mutex inference to determine an appropriate 
set of locks for protecting accesses to shared data within an atomic 
section. We have proven that mutex inference provides the same 
amount of parallelism as if we had one lock per location. 
In addition to the aforementioned ideas for making our approach 
more efficient, it would be interesting to understand how optimistic 
and pessimistic concurrency controls could be combined. In partic- 
ular, the former is much better and handling deadlock, while the lat- 
ter seems to perform better in many cases [9]. Using our algorithm 
could help reduce the overhead and limitations (e.g., handling 110) 
of an optimistic scheme while retaining its liveness benefits. 
References 
[I] B. Bums, K. Grimaldi, A. Kostadinov, E. D. Berger,and M. D. Comer. 
Flux: A Language for Programming High-Performance Servers. In 
In Proceedings of the Usenix Annual Technical Conference, 2006. To 
appear. 
[2] C. Flanagan and S. N. Freund. Automatic synchronization correction. 
In Synchronization and Concurrency in Object- Oriented Languages 
(SCOOL), Oct. 2005. 
[3] C. Flanagan, S. N. Freund, and M. Lifshin. Type Inference for 
Atomicity. In TLDI, 2005. 
[4] C. Flanagan and S. Qadeer. A Type and Effect System for Atomicity. 
In PLDI, 2003. 
[5] T. Hams and K. Fraser. Language support for lightweight transac- 
tions. In OOPSLA ' 0 3 ,  pages 388-402, Oct. 2003. 
[6] T. Hams, S. Marlow, S. P. Jones, and M. Herlihy. Composable 
memory transactions. In PPoPP '05, June 2005. 
[7] M. Herlihy. V. Luchangco, M. Moir, and W. N. S. 111. Software 
transactional memory for dynamic-sized data structures. In PODC 
'03, pages 92-1 01, July 2003. 
[8] J. M. Lucassen and D. K. Gifford. Polymorphic Effect Systems. In 
POPL, 1988. 
[9] B. McCloskey, F. Zhou, D. Gay, and E. Brewer. Autolocker: 
synchronization inference for atomic sections. In POPL'06, pages 
346358. ACM Press, 2006. 
[lo] P. Pratikakis, J. S. Foster, and M. Hicks. Locksmith: Context-Sensitive 
Correlation Analysis for Race Detection. In Proceedittgs of the 2006 
PLDI, Ottawa, Canada, June 2006. To appear. 
[ l l ]  J. Rehof and M. FAndrich. Type-Based Flow Analysis: From 
Polymorphic Subtyping to CE-Reachability. In POPL, 2001. 
[12] M. F. Ringenburg and D. Grossman. Atomcaml: First-class atomicity 
via rollback. In ICFP '05, pages 92-104, Sept. 2005. 
~ X , X e X ae , e~
Xed.

































1] . . . i v, . r, .
i ce rs.
f l e, .
r.
] . . . ti ti
L), . .
] . , . .
l 2 .
] . . . i .
l .
] . rri r.
s. 3, , t. .
] . rri . . , . .
tions. 5,
] , . III.
l .
, -1
] . . . t s.
1988.
] . , . r. r:
ti ns.
- ,
10] . . r, . . it : iti
ti . n f
, , , r.
11] ah s is:
FL-Reacha i .
] r . . l: it
. 5, 4, t. .
2006/5/16
[13] A. Welc. S. Jagannathan, and A. L. Hosking. Transactional monitors 




Higher Order Combinators for Join Patterns using STM 
Satnam Singh 
Microsoft 
One Microsoft Way 
Redmond W A  98052, USA 
+I4257058208 
ABSTRACT 
Join patterns provide a higher level concurrent programming 
construct than the explicit use of threads and locks and have 
typically been implemented with special syntax and run-time 
support. This paper presents a strikingly simple design for a small 
number of higher order combinators which can be composed 
together to realize a powerful set of join patterns as a library in 
an existing language. The higher order combinators enjoy a lock 
free implementation that uses software transactional memory 
(STM). This allows joins patterns to be implemented simply as a 
library and provides a transformational semantics for join 
patterns. 
1. INTRODUCTION 
Join patterns provide a way to write concurrent programs that 
provide a programming model which is higher level than the 
direct invocation of threads and the explicit use of locks in a 
specific order. This programming model has at its heart the notion 
of atomically consuming messages from a group of channels and 
then executing some code that can use the consumed message 
values. Join patterns can be used to easily encode related 
concurrency idioms like actors and active objects [1][14] as 
shown by Benton et. al. in [4]. Join patterns typically occur as 
language-level constructs with special syntax along with a 
sophisticated implementation for a state machine which governs 
the atomic consumption of messages. The contribution of this 
paper is to show how join patterns can be modeled using a small 
but powefi l  collection of higher order combinations which can 
be implemented in a lock free style using software transactional 
memory. The combinators are higher order because they take 
functions (programs) as arguments and return functions (programs 
as result) which glue together the input programs to form a 
resulting composite program which allows us to make a domain 
specific language for join patterns. All of this is achieved as a 
library in an existing language without requiring any special 
syntax or run-time code. The complete implementation appears in 
this paper. 
Join patterns emerged from a desire to find higher level 
concurrency and communication constructs than locks and threads 
for concurrent and distributed programs [13:1[6]. For example, the 
work of Fournet and Gonthier on join calculus [10][11] provides a 
process calculi which is amenable to direct implementation in a 
distributed setting. Related work on JoCaml [8] and Funnel [20] 
present similar ideas in a functional setting. An adaptation of join- 
calculus to an object-oriented setting is found in Comega 
(previously known as Polyphonic c') [4] and similar extensions 
have also been reported for Java [I 61. 
Concurrent programming using join patterns promises to provide 
useful higher level abstractions compared with asynchronous 
message passing programs that directly manipulate ports. Comega 
adds new language features to C' to implement join patterns. 
Adding concurrency features as language extensions has many 
advantages including allowing the compiler to analyze and 
optimize programs and detect problems at compile time. This 
paper presents a method of introducing a flexible collection of 
join operations which are implemented solely as a library. We do 
assume the availability of software transactional memories (STM) 
which may be implemented as syntactic language extensions or 
introduced just as a library. In this paper we use the lazy 
functional programming language Haskell as our host language 
for join patterns implemented in terms of STM because of the 
robust implementation which provides composable memory 
transactions [I31 which also exploits the type system to statically 
forbid side effecting operations inside STM. In Haskell the STM 
functionality is made available through a regular library. We 
make extensive use of the composable nature of Haskell's STM 
implementation to help define join pattern elements which also 
possess good compensability properties. Other reasons for using 
Haskell include it support for very lightweight threads which 
allows us to experiment with join pattern programs with vastly 
more threads than is practical using a language in which threads 
are implemented directly with operating system threads. 
The remainder of this paper briefly presents the salient features of 
Comega and STM in Haskell and then goes on to show how join 
patterns can be added as a library using STM. This paper contains 
listings for several complete Comega and Haskell programs and 
the reader is encouraged to compile and execute these programs. 
2. JOIN PATTERNS IN COMEGA 
The polyphonic extensions to C' comprise just two new concepts: 
(i) asynchronous methods which return control to the caller 
immediately and execute the body of the method concurrently; 
and (ii) chords (also known as 'synchronization patterns' or 'join 
patterns') which are methods whose execution is predicated by 
the prior invocation of some null-bodied asynchronous methods. 
2.1 ASYNCHRONOUS METHODS 
The code below is a complete Comega program that demonstrates 
an asynchronous method. 
i i i i
t i
i r ft
i r ft a
,















ifi r r. i




















i t i t tti g. l t l l
t i il r i i ti l tti . t ti j i -
l l i
( r i l l i C#) [ ] i il r t i s
1 ]
t i i j in tt i s t i
l i l l t ti it
i t t i tl i l t t .







i t f f
ti l















' r nization tterns' 'j in
s') t
NCHRONOUS
l i l t t t t t
t .
using System ; s t a t i c  void Main0 
public c l a s s  MainProgram 
{ public c lass  ArraySummer 
( public async sumArray (int[] intarray) 
{ int sum = 0 ; 
foreach (int value i n  intArray) 
sum += value ; 
Console.WriteLine ("Sum = " + sum) ; 
) 
1 
s t a t i c  void Main ( ) 
( Summer = new ArraySummer ( )  ; 
Summer.sumArray (new int[l (1, 0 ,  6, 3, 5 ) )  ; 
Summer.sumArray (new int[l {3, 1, 4, 1, 2 ) )  ; 
Console.WriteLine ("Main method done.") ; 
1 
} 
Comega introduces the async keyword to identify an 
asynchronous method. Calls to an asynchronous method return 
immediately and asynchronous methods do not have a return type 
(they behave as if their return type is void). The sumArray 
asynchronous method captures an array from the caller and its 
body is run concurrently with respect to the caller's context. The 
compiler may choose a variety of schemes for implementing the 
concurrency. For example, a separate thread could be created for 
the body of the asynchronous method or a work item could be 
created for a thread pool or, on a multi-processor or multi-core 
machine, the body may execute in parallel with the calling 
context. The second call to the s m r r a y  does not need to wait 
until the body of the sudrray method finishes executing from 
the first call to s m r r a y .  
In this program the two calls to the sudrray method of the 
summer object behave as if the body of s m r r a y  was forked off 
as a separate thread and control returns immediately to the main 
program. When this program is compiled and run it will in general 
write out the results of the two summations and the Main method 
done text in arbitrary orders. The Comega compiler can be 
downloaded from: htto://research.microsoft.com/Con~e~a/ 
2.2 CHORDS 
The code below is a complete Comega program that demonstrates 
how a chord can be used to make a buffer. 
using System ; 
public c l a s s  MainProgram 
( public c l a s s  Buffer 
{ public async Put (int value) ; 
public int Get ( )  & Put (int value) 
{ return value ; } 
} 
{ buf = new Buffer ( )  ; 
buf .Put (42) ; 
buf .Put (66) ; 
The & operator groups together methods that form a join pattern in 
Comega. A join pattern that contains only asynchronous methods 
will concurrently execute its body when all of the constituent 
methods have been called. A join pattern may have one (but not 
more) synchronous method which is identified by a return type 
other than async. The body for a synchronous join pattern fires 
when all the constituent methods (including the synchronous 
method) are called. The body is executed in the caller's context 
(thread). The Comega join pattern behaves like a join operation 
over a collection of ports (e.g. in JoCaml) with the methods taking 
on a role similar to ports. The calls to the put method are similar 
in spirit to performing an asynchronous message send (or post) to 
a port. In this case the port is identified by a method name (i.e. 
put). Although the asynchronous posts to the put 'port' occur in 
series in the main body the values will arrive in the put 'port' in 
an arbitrary order. Consequently the program shown above will 
have a non-deterministic output writing either "42 66" or "66 
42". 
3. STM IN CONCURRENT HASKELL 
Software Transactional Memory (STM) is a mechanism for 
coordinating concurrent threads. We believe that STM offers a 
much higher level of abstraction than the traditional combination 
of locks and condition variables, a claim that this paper should 
substantiate. The material in this section is largely borrowed 
directly from [2]. We briefly review the STM idea, and especially 
its realization in concurrent Haskell; the interested reader should 
consult [9] for much more background and details. 
Concurrent Haskell [21] is an extension to Haskell 98, a pure, 
lazy, functional programming language. It provides explicitly- 
forked threads, and abstractions for communicating between 
them. These constructs naturally involve side effects and so, 
given the lazy evaluation strategy, it is necessary to be able to 
control exactly when they occur. The big breakthrough came 
from using a mechanism called monads [22] . Here is the key 
idea: a value of type 10 a is an "I10 action" that, when 
performed may do some inputloutput before yielding a value of 
type a. For example, the functions putchar and getchar have 
types: 
putchar :: Char -> I0 0 
getchar :: I0 Char 
That is, putchar takes a Char and delivers an 110 action that, 
when performed, prints the string on the standard output; while 
getchar is an action that, when performed, reads a character 
from the console and delivers it as the result of the action. A 




r int[] ntA ray)







Console.WriteLine (buf.Get() + " " +
buf. Get ()) ;
onsol . rit i "Sum m)
i ai )
er r yS er )
mer.s A ra [] I, 1)
mer.s A ray ] (3, 1)






S er u Arra
ai et


























t . l , t ti tChar tC r
:
tC ar har 1 ()





executing the program means performing that action. For 
example: 
main : :  I0 ( )  
main = putchar 'x' 
110 actions can be glued together by a monadic bind combinator. 
This is normally used through some syntactic sugar, allowing a C- 
like syntax. Here, for example, is a complete program that reads a 
character and then prints it twice: 
main = do ( c <- getchar; putchar c; putchar c 1 
Threads in Haskell communicate by reading and writing 
transactional variables, or TVars. The operations on TVars are 
as follows: 
data TVar a 
newTVar : :  a -> STM (TVar a) 
readTVar :: TVar a -> STM a 
writeTVar : :  TVar a -> a -> STM 0 
All these operations all make use of the STM monad, which 
supports a carefully-designed set of transactional operations, 
including allocating, reading and writing transactional variables. 
The readTVar and writeTVar operations both return STM 
actions, but Haskell allows us to use the same do . . . ) syntax 
to compose STM actions as we did for I/O actions. These STM 
actions remain tentative during their execution: in order to expose 
an STM action to the rest of the system, it can be passed to a new 
function atomically, with type 
atomically : :  STM a -> I0 a 
It takes a memory transaction, of type STM a, and delivers an I10 
action that, when performed, runs the transaction atomically with 
respect to all other memory transactions. For example, one might 
say: 
main = do ( . . .  ; atomically (getR r 3); . . .  1 
Operationally, atomically takes the tentative updates and actually 
applies them to the TVars involved, thereby making these effects 
visible to other transactions. The atomically function and all of 
the STM-typed operations are built over the software 
transactional memory. This deals with maintaining a per-thread 
transaction log that records the tentative accesses made to TVarS. 
When atomically is invoked the STM checks that the logged 
accesses are valid - i.e. no concurrent transaction has committed 
conflicting updates. If the log is valid then the STM commits it 
atomically to the heap. Otherwise the memory transaction is re- 
executed with a fresh log. 
Splitting the world into STM actions and VO actions provides two 
valuable guarantees: (i) only STM actions and pure computation 
can be performed inside a memory transaction; in particular V 0  
actions cannot; (ii) no STM actions can be performed outside a 
transaction, so the programmer cannot accidentally read or write a 
TVar without the protection of atomically. Of course, one can 
always write atomically (read~~ar V) to read a TVar in a trivial 
transaction, but the call to atomically cannot be omitted. As an 
example, here is a procedure that atomically increments a TVar: 
incT : : TVar Int -> I0 0 
incT v = atomically (do x <- readTVar v 
writeTVar v (x+l)) 
The implementation guarantees that the body of a call to 
atomically runs atomically with respect to every other thread; for 
example, there is no possibility that another thread can read v 
between the r e a d ~ ~ a r  nd write~~ar of inc~. 
A transaction can block using retry: 
retry : :  STM a 
The semantics of retry is to abort the current atomic transaction, 
and re-run it after one of the transactional variables has been 
updated. For example, here is a procedure that decrements a 
TVar, but blocks if the variable is already zero: 
decT : : TVar Int -> I0 ( )  
decT v = atomically (do x <- readTVar v 
when (x == 0) 
retry 
writeTVar v (x-1)) 
The when function examines a predicate (here the text to see if x 
is 0) and if it is true it executes a monadic calculation (here 
retry). 
Finally, the orElse function allows two transactions to be tried in 
sequence: (sl 'orElse' s2) is a transaction that first attempts 
sl; if it calls retry, then s2 is tried instead; if that retries as well, 
then the entire call to orElse retries. For example, this 
procedure will decrement vl unless vl is already zero, in which 
case it will decrement v2. If both are zero, the thread will block: 
decPair vl vl : :  TVar Int -> TVar Int -> I0 ( )  
decPair vl v2 
= atomically (decT vl 'orElse' decT v2) 
In addition, the STM code needs no modifications at all to be 
robust to exceptions. The semantics of atomically is that if an 
exception is raised inside the transaction, then no globally visible 
state change whatsoever is made. 
.
:
ai : 1 )
ai C x'
I/O







ncT : ar 1 ()
ncT i l y do eadT




ar · . T r
eadT · . ar
ri eT · . ar ()
,
eadT ri e ar
, { .
i al .. 1




: ar 1 )


















ai I I .. ar ar 1 )
ai I
i l decT I or
An example of how a concurrent data structure from the Java 
JSR-166 [I81 collection can be written using STM in Haskell 
appears in [2]. 
4. IMPLEMENTING JOINS WITH STM 
4.1 TRANSACTED CHANNELS 
To help make join patterns out of the STM mechanism in Haskell 
we shall make use of an existing library which provides 
transacted channels: 
data TChan a 
newTChan : : STM (TChan a) 
readTChan : :  TChan a -> STM a 
writeTChan : :  TChan a -> a -> STM ( )  
A new transacted channel is created with a call to newTChan. A 
value is read from a channel by readTChan and a value is written 
by writeTChan. These are tentative operations which occur 
inside the STM monad and they have to be part of an STM 
expression which is the subject of a call to atomically in order 
to actually execute and commit. 
4.2 SYNCHRONOUS JOIN PATTERNS 
A first step towards trying to approach a join pattern like feature 
of Comega is to try and capture the notion of a synchronous join 
pattern. We choose to model the methods in Comega as channels 
in Haskell. We can then model a join pattern by atomically 
reading from multiple channels. This feature can be trivially 






join2 : :  TChan a -> TChan b -> I0 (a, b) 
join2 chanA chanB 
= atomically (do a <- readTChan chanA 
b <- readTChan chanB 
return (a, b) 
) 
taskA : :  TChan Int -> TChan Int -> I0 0 
taskA chanl chan2 
= do (vl, v2) <- join2 chanl chan2 
putStrLn ("taskA got: " ++ show (vl, v2)) 
main 
= do chanA <- atomically newTChan 
chanB <- atomically newTChan 
atomically (writeTChan chanA 42) 
atomically (writeTChan chanB 75) 
taskA chanA chanB 
Assuming this program is saved in a file called Join2. hs it can 
be compiled using the commands shown below. The Glasgow 
Haskell compiler can be downloaded from 
htt~://~v~w.haskell.org/ghc/ 
$ ghc --make -fglasgow-exts Join2.h~ -0 join2.exe 
Chasing modules from: Join2.h~ 
Compiling Main ( Join2.hs, Join2.0 ) 
Linking . . .  
$ ./join2.exe 
taskA got: (42,75) 
In this program the join2 function takes two channels and 
returns a pair of values which have been read from each channel. 
If either or both of the channels are empty then the STM aborts 
and retries. Using this definition of join2 we still do not have a 
full chord yet and we have to piece together the notion of 
synchronizing on the arrival of data on several channels with the 
code to execute when the synchronization fires. This is done in 
the function tas kA. 
The implementation of the join mechanism in other languages 
might involve creating a state machine which monitors the arrival 
of messages on several ports and then decides which handler to 
run. The complexity of such an implementation is proportional to 
the number of ports being joined. Exploiting the STM mechanism 
in Haskell gives a join style synchronization almost for free but 
the cost of this implementation also depends on the size of the 
values beings joined because these values are copied into a 
transaction log. 
4.3 ASYNCHRONOUS Jom PATTERNS 
In the code above taskA is an example of a synchronous join 
pattern which runs in the context of the caller. We can also 





join2 : :  TChan a -> TChan b -> I0 (a, b) 
join2 chanA chanB 
= atomically (do a <- readTChan chanA 
b <- readTChan chanB 
return (a, b) 
1 
asyncJoin2 chanl chan2 handler 
= fork10 (asyncJoinLoop2 chanl chan2 handler) 
asyncJoinLoop2 chanl chan2 handler 






: M T n
eadT :











or ontr l ncur
or ont .Con rrent.STM
n : 1 a,
n2 nA nB
i l y do eadT nA
eadT nB
u n a,
askA : 1 ()
askA
I, n2 I
rL "taskA t how I, )
ai
nA i l y
nB i l y
i l iteTChan n




-m fglasgo s in2.hs 0 n ex
has odul rom n2.hs















r ontr l ncurr
r ont .Con rrent.STM
n2 : 1 a,
n2 nA nB




rkIO asyncJoinLo p2 l )
JoinLoop2 l
I, n
fork10 (handler vl v2) 
asyncJoinLoop2 chanl chan2 handler 
taskA : :  Int -> Int -> I0 ( )  
taskA vl v2 
= putStrLn ("taskA got: " ++ show (vl, v2)) 
taskB : :  Int -> Int -> I0 0 
taskB vl v2 
= putStrLn ("taskB got: " ++ show (vl, v2)) 
main 
= do chanA <- atomically newTChan 
chanB <- atomically newTChan 
chanC <- atomically newTChan 
atomically (writeTChan chanA 42) 
atomically (writeTChan chanC 75) 
atomically (writeTChan chanB 21) 
atomically (writeTChan chanB 14) 
asyncJoin2 chanA chanB taskA 
asyncJoin2 chanB chanC taskB 
threadDelay 1000 
asyncJoin2 here is different from join2 in two important 
respects. First, the intention is that the join should automatically 
re-issue. This is done by recursively calling asyncJoinLoop2. 
Second, this version concurrently executes the body (handler) 
when the join synchronization occurs (this corresponds to the case 
in Comega when a chord only contains asynchronous methods). 
This example spawns off two threads which compete for values 
on a shared channel. 
When either thread captures values from a join pattern it then 
forks off a handler thread to deal with these values and 
immediately starts to compete for more values from the ports it is 
watching. Here is a sample execution of this program: 
> ./main 
taskA got: (42,21) 
taskB got: (14,75) 
4.4 HIGHER ORDER JOIN COMBINATORS 
Haskell allows the definition of infix symbols which can help to 
make the join patterns much easier to read. This section presents 
some type classes which in conjunction with infix symbols 
provide a convenient syntax for join patterns. 
A synchronous join pattern can be represented using one infix 
operator to identify channels to be joined and another operator to 
apply the handler. The infix operators are declared to be left 
associative and are given binding strengths. The purpose of the & 
combinator is to compose together the elements of a join pattern 
which identify when the join should fire (in this case it identifies 
channels). The purpose of the synchronous >>> combinator is to 
take a join pattern and execute a handler when it fires. The result 
of the handler expression is the result of the join pattern. We use a 
lambda expression to bind names to the results of the join pattern 
although we could also have used a named function. A sample 





infixl 5 & 
infixl 3 >>> 
( & )  : :  TChan a -> TChan b -> STM (a, b) 
( & )  chanl chan2 
= do a <- readTChan chanl 
b <- readTChan chan2 
return (a, b) 
(>>>) : :  STM a -> (a -> I0 b) -> I0 b 
(>>>) joinpattern handler 
= do results <- atomically joinpattern 
handler results 
example chanl chan2 
= chanl & chan2 >>> \ (a, b) -> putStrLn (show 
(a, b)) 
main 
= do chanl <- atomically newTChan 
chan2 <- atomically newTChan 
atomically (writeTChan chanl 14) 
atomically (writeTChan chan2 "wombat") 
example chanl chan2 
This program writes " (14, "wombat") ". We can define an 
operator for performing replicated asynchronous joins in a similar 
way, as shown below. 
... 
( > ! > )  : :  STM a -> (a -> I0 ( ) )  -> I0 ( )  
(> !  >)  joins cont 
= do fork10 (asyncJoinLoop joins cont) 
return ( )  -- discard thread ID 
asyncJoinLoop joinpattern handler 
= do results <- atomically joinpattern 
fork10 (handler results) 
asyncJoinLoop joinpattern handler 
forkIO (h ler I )
yncJoinLoop2 1 n2 l
taskA .. t t 1 ()
taskA I
t trL "taskA t: how I, )
c a els). e r se f t e s c r s » c i at r is t
ta e a j i atter a e ec te a a ler e it fires. e res lt
f t e a ler e ressi is t e res lt f t e j i atter . e se a
la a e ressi t i a es t t e res lts f t e j in atter
lt l ls s f ti . s l




rL "taskB how vI, )
askB 1 ()
i ort ontr l. r t
or ont .Co rent.STM
ai
hanA o i l y w
hanB o i l y
hanC o i l y
o i l y iteTChan anA )
o i y iteTChan anC )
o i y iteTChan anB )
o i l y iteTChan anB )
y in2 anA anB askA










i t r s
r s
i t r

















»» : a 1 1
»» P t n dl
l i al P t
l l
l





i al teTChan l
ical teTChan bat )
pl l
(14, o bat") .
j i s




oinLoop P t dl
l ical P t
rkIO r l )
oinLoo P t dl r
example chanl chan2 
= chanl & chan2 > ! >  \ (a, b) -> putStrLn (show 
((a, b))) 
main 
= do chanl <- atomically newTChan 
chan2 <- atomically newTChan 
atomically (writeTChan chanl 14) 
atomically (writeTChan chan2 "wombat") 
atomically (writeTChan chanl 45) 
atomically (writeTChan chan2 "numbat") 
example chanl chan2 
threadDelay 1000 
The continuation associated with the joins on chanl and chan2 is 
run each time the join pattern fires. A sample output is: 
(14, "wombat") 
(45, "numbat") 
The asynchronous pattern > ! > runs indefinitely or until the main 
program ends and brings down all the other threads. One could 
write a variant of this join pattern which gets notified when it 
becomes indefinitely blocked (through an exception). This 
exception could be caught and used to terminate 
asyncJoinLoop. We choose to avoid such asynchronous 
finalizers. 
We can use Haskell's multi-parameter type class mechanism to 
overload the definition of & to allow more than two channels to be 
joined. Here we define a type class called Joinable which 
allows us to overload the definition of &. There instances are 
given: one for the case where both arguments are transacted 
channels; one for the case where the second argument is an STM 
expression (resulting from another join pattern); and one for the 
case where the left argument is an STM expression. A fourth 
instance for the case when both arguments are STM expressions 
has been omitted but is straight forward to define. 
class Joinable tl t2 where 
( & )  :: tl a -> t2 b -> STM (a, b) 
instance Joinable TChan TChan where 
( & )  = join2 
instance Joinable TChan STM where 
( & )  = join2b 
instance Joinable STM TChan where 
( & )  a b = do (x,y) <- join2b b a 
return (y, x) 
join2b : :  TChan a -> STM b -> STM (a, b) 
join2b chanl stm 
= do a <- readTChan chanl 
b <- stm 
return (a, b) 
example chanl chan2 chan3 
= chanl & chan2 & chan3 >>> \ ((a, b), c) -> 
putStrLn (show [a,b, cl ) 
main 
= do chanl <- atomically newTChan 
chan2 <- atomically newTChan 
chan3 <- atomically newTChan 
atomically (writeTChan chanl 14) 
atomically (writeTChan chan2 75) 
atomically (writeTChan chan3 11) 
example chanl chan2 chan3 
One problem with this formulation is that the & operator is not 
associative. The & was defined to be a left-associated infix 
operator which means that different shapes of tuples are returned 
from the join pattern depending on how the join pattern is 
bracketed. For example: 
example1 chanl chan2 chan3 
= (chanl & chan2) & chan3 >>> \ ((a, b), c) -> 
putStrLn (show [a,b, cl ) 
example2 chanl chan2 chan3 
= chanl & (chan2 & chan3) >>> \ (a, (b, c) ) -> 
putStrLn (show [a,b, c] ) 
It would be much more desirable to have nested applications of 
the & operator return a flat structure. We can address this problem 
in various ways. One approach might be to use type classes again 
to provide overloaded definitions for >>> which fix-up the return 
type to be a flat tuple. This method is brittle because it requires us 
to type in instance declarations that map every nested tuple 
pattern to a flat tuple and we can not type in all of them. Other 
approaches could exploit Haskell's dynamic types or the template 
facility for program splicing to define a meta-program that re- 
writes nested tuples to flat tuples. We do not go into the details of 
these technicalities here and for clarity of exposition we stick with 
the nested tuples for the remainder of this paper. 
4.5 JOINS ON LISTS OF CHANNELS 
Joining on a list of channels is easily accomplished by mapping 
the channel reading operation on each element of a list. This is 
demonstrated in the one line definition of joinlist below. 
joinList : : [TChan a] -> STM [a] 
joinList = mapM readTChan 
example channels chan2 
= joinList channels & chan2 >>> \ (a, b) -> 
putStrLn (show (a, b)) 
main 
= do chanl <- atomically newTChan 
chan2 <- atomically newTChan 
chan3 <- atomically newTChan 
atomically (writeTChan chanl 14) 
ai
l i l y ai
i l y
ical iteTChan )
ical iteTChan bat )
ical iteTChan )
ical iteTChan bat )
l





























t rL show b, ])
l i l y
i l y
ical
i l y iteTChan 1 )
i l y iteTChan )




ch nl » (a,
t L show b, ])
l 2 l
l chan2 ) » a, b, )









t L show a,
ai
i l y h
i l y
i l y
i l y iteTChan l )
atomically (writeTChan chan2 75) 
atomically (writeTChan chan3 11) 
example [chanl, chan21 chan3 
This program writes out " ( [ 14,75 I ,11) ". One can define a join 
over arrays of ports in a similar way. For greater generality one 
could define a type class to introduce a method for mapping a 
type T (TChan a) to the isomorphic type T a by performing 
readTChan operations on each channel. One could also look into 
ways of overloading & to operate directly over lists and arrays but 
applying the joinList function as shown above seems to work 
well and interacts well as an expression in a join pattern. 
4.6 CHOICE 
The biased choice combinator allows the expression of a choice 
between two join patterns. The choice is biased because it will 
always prefer the first join pattern if it can fire. Each alternative is 
represented by a pair which contains a join pattern and the action 
to be executed if the join pattern fires. 
( ! + I )  :: ( S T M a ,  a -> I O c )  -> 
(STM b, b -> I0 C) -> 
I0 c 
( 1 +I ) (joina, actionl) (joinb, action2) 
= do io <- atomically 
(do a <- joina 
return (actionl a) 
'orElse' 
do b <- joinb 
return (action2 b)) 
i o 
Here the orElse combinator is used to help compose alternatives. 
This combinator tries to execute the first join pattern (joina) and 
if it succeeds a value is bound is the variable a and this is used as 
input to the I 0  action called actionl. If the first join pattern can 
not fire the first argument of orElse performs a retry and then 
the second alternative is attempted (using the pattern joinb). 
This will either succeed and the value emitted from the joinb 
pattern is then supplied to action2 or it will fail and the whole 
STM express will perform a retry. 
A fairer choice can be made by using a pseudo-random variable to 
dynamically construct an orElse expression which will either 
bias joina or joinb. Another option is to keep alternating the 
roles of joina and joinb by using a transacted variable to record 
which join pattern should be checked first. 
4.7 DYNAMIC JOINS 
Join patterns in Comega occur as declarations which make them a 
very static construct. Often one wants to dynamically construct a 
join pattern depending on some information that is only available 
at run-time. This argues for join patterns occurring as expressions 
or statements rather than as declarations. Since in our formulation 
join patterns are just expressions we get dynamic joins for free. 
Here is a simple example: 
example numsensors nunsensors chanl chan2 chan3 
= if numsensors = 2 then 
chanl & chan2 > ! >  \ (a, b) 
-> putStrLn (show ((a, b) 1 )  
else 
chanl & chan2 & chan3 > ! >  \ ((a, b), c) 
-> putStrLn (show ((a, b, c))) 
In this example the value of the variable numsensors is used to 
determine which join pattern is executed. A more elaborate 
example would be a join pattern which used the values read from 
the patter to dynamically construct a new join pattern in the 
handler function. Another example would be a join pattern which 
returns channels which are then used to dynamically construct a 
join pattern in the handler function. 
Statically defined joins enjoy more opportunities for efficient 
compilation and analysis than dynamically constructed joins. 
4.8 CONDITIONAL JOINS 
Sometimes it is desirable to predicate a join pattern to fire only 
when some guard conditions are met or only if the values that 
would be read by the join pattern satisfy a certain criteria. 
We can avoid the complexities of implementing conditional join 
patterns through tricky concurrent programming and language 
extension by once again exploiting the STM library interface in 
Haskell. First we define guards that predicate the consumption of 
a value from a channel. 
( ? )  : :  TChan a -> Bool -> STM a 
( ? )  chan predicate 




example cond chanl chan2 
= (chanl ? cond) & chan2 >>> \ (a, b) -> 
putStrLn (show (a, b)) 
The guards expressed by ? can only be boolean expressions and 
one could always have written a dynamically constructed join 
pattern instead of a guard. The implementation exploits the retry 
function in the Haskell STM interface to abort this transacted 
channel read if the predicate is not satisfied. 
A more useful kind of conditional join would want to access some 
shared state about the system to help formulate the condition. 
Shared state for STM programs can only be accessed via the STM 
monad so we can introduce another overloaded version of ? 
which takes a condition in the STM monad: 
( ? )  : :  TChan a -> STM Bool -> STM a 
( ? )  chan predicate 
= do cond <- predicate 




Now the predicate can be supplied with transacted variables 
which can be used to predicate the consumption of a value from a 
channel. These conditions can also update shared state. Several 











1+1) : STM , - IO -
STM , O c)
O




















pl Sensor rnSensor l
Sensor h
l ,
t L show «a, ))
l «a, )














t L show , )
i .





STM mechanism will ensure that only consistent updates are 
allowed. 
This definition of ? also allows quite powerful conditional 
expressions to be written which can depend on the values that 
would be read from other channels in the join pattern. The 
condition STM predicate can be supplied with the channels in the 
join pattern or other transacted variables to help form the 
predicate. This allows quite dynamic forms of join e.g. sometimes 
performing a join pattern on channels chanl and chan2 and 
sometimes performing a join pattern on channels chanl and 
chan3 depending on the value read from chanl. 
A special case of the STM predicate version of ? is a conditional 
join that tests to see if the value that would be read satisfies some 
predicate. The code below defines a function ? ?  which takes such 
a predicate function as one of its arguments. The example shows a 
join pattern which will only fire if the value read on chanl is 
greater than 3. 
( ? ? )  : :  TChan a -> (a -> Bool) -> STM a 
( ? ? )  chan predicate 
= do value <- readTChan chan 




example chanl chan2 
= (chanl ? ?  \x -> x > 3) & chan2 >>> \ (a, b) 
-> putStrLn (show (a, b)) 
A conditional join pattern could be implemented in Comega by 
returning a value to a port if it does not satisfy some predicate. If 
several threads read from the same port and then return the values 
they read there is a possibility that the port will end up with 
values returned in a different order. Furthermore, other threads 
can make judgments based on the state of the port after the value 
has been read but before it has been returned. The conditional 
formulations that we present where atomically remove values 
from a port when a predicate is satisfied so they do not suffer 
from such problems. 
4.9 NON-BLOCKING VARIANTS 
Non-blocking variants may be made by composing the blocking 
versions of join patterns using orElse with an alternative that 
returns negative status information. This is demonstrated in the 
definition of nonBlockingJoin below which returns a value 
wrapped in a Maybe type which has constructors ~ u s t  a for a 
positive result and Nothing for a negative result. 
nonBlockingJoin : :  STM a -> STM (Maybe a) 
nonBlockingJoin pattern 
= (do result <- pattern 




Understanding how exceptions behave in this join pattern scheme 
amounts to understanding how exceptions behave in the Haskell 
STM interface. Exceptions can be thrown and caught as described 
in [13]. Our encoding of join patterns gives a default backward 
error recovery scheme for the implementation of the join pattern 
firing mechanism because if an error occurs in the handler code 
the transaction is restarted and any consumed values are returned 
to ports from which they were read. The handler code however 
does not execute in the STM monad so it may raise exception. 
This exception will require forward error recovery which may 
involve returning values to channels because this code is executed 
after the transacted consumption of values from channels has 
committed. 
5. RELATED WORK 
A join pattern library for C# called CCR was recently announced 
[7] although the underlying model is quite different what is 
presented here. This model exposes 'arbiters' which govern how 
messages are consumed (or returned) to ports. These arbiters are 
the fundamental building blocks which are used to encode a 
variety of communication and synchronization constructs 
including a variant of join patterns. A significant difference is the 
lack of a synchronous join because all handler code for join 
patterns is asynchronously executed on a worker thread. This 
requires the programmer to explicitly code in a continuation 
passing style although the iterator mechanism in C# has been 
exploited by the CCR to effectively get the compiler make the 
continuation passing transform automatically for the user (in the 
style of CLU [17]). 
One could imagine extending Haskell with JoCaml [l 11 style join 
patterns which are special language feature with special syntax. 
Here is an example of a composite join pattern from the JoCaml 
manual: 
# let def apple! 0 I pie! 
= print-string "apple pie" ; 
# or raspberry! 0 I pie! 0 
= print-string "raspberry pie" ; 
Three ports are defined: apple, pie and raspberry. The composite 
join pattern defines a synchronization pattern which contains two 
alternatives: one which is eligible to fire when there are values 
available on the ports apple and pie and the other when there are 
values available on raspberry and pie. When there is only one 
message on pie the system makes an internal choice e.g. 
# spawn {apple 0 I raspberry 0 I pie 0 
# ;; 
-> raspberry pie 
Alternatively, the system could have equally well responded with 
apple pie. Expressing such patterns using the Haskell STM 
encoding of join patterns seems very similar yet this approach 
does not require special syntax or language extensions. However, 
making join patterns concrete in the language does facilitate 
compiler analysis and optimization. 
6. CONCLUSIONS AND FUTURE WORK 
The main contribution of this paper is the realization in Haskell 
STM of join combinators which model join patterns that already 
exist in other languages. The embedding of Comega style join 
patterns into Haskell by exploiting a library that gives a small but 


























l cki J i : e















ri t_stri g i
y! () I ()
ri t_stri g i
,
.
aw () () i ())
# "
i
l i . i tt i t ll
t
NCLUSIONS URE
expressive power. Furthermore, the embedding is implemented 
solely as a library without any need to extend the language and 
modify the compiler. The entire source of the embedding is 
compact enough to appear in several forms in this paper along 
with examples. 
Several reasons conspire to aid the embedding of join patterns as 
we have presented them. The very composable nature of STM in 
Haskell means that we can separately define the behavior of 
elements of join patterns and then compose them together with 
powerful higher order combinators like &, >>>, > !  > and ?. STM 
actions can be glued together and executed atomically which 
allows a good separation of concerns between what to do about a 
particular channel and what to do about the interaction between 
all the channels. The behavior of the exception mechanism also 
composes in a very pleasant way. 
The type safety that Haskell provides to ensure that no side- 
effecting operations can occur inside an STM operation also 
greatly aids the production of robust programs. The ability to 
define symbolic infix operators and exploit the type class system 
for systematic overloading also help to produce join patterns that 
are concise. We also benefit from representing join patterns as 
expressions rather than as declarations in Comega. 
The STM mechanism proves to be very effective at allowing us to 
describe conditional join patterns. These would be quite 
complicated to define in terms of lower level concurrency 
primitives. We were able to give very short and clear definitions 
of several types of conditional join patterns. 
The ability to perform dynamic joins over composite data 
structures that contain ports (like lists) and conditional joins 
makes this library more expressive than what is currently 
implemented in Comega. Furthermore, in certain situations the 
optimistic concurrency of a STM based implementation may yield 
advantages over a more pessimistic lock-based implementation of 
a finite state machine for join patterns. Another approach for 
realizing join patterns in a lock free manner could involve 
implementing the state machine at the heart of the join machinery 
in languages like Comega using STM rather than explicit locks. 
Even if an STM representation of join patterns is not the first 
choice of an implementer we think that the transformational 
semantics that they provide for join patterns is a useful model for 
the programmer. Many of the join patterns we have shown could 
have been written directly in the STM monad. We think that when 
synchronization is appropriately expressible as a join pattern then 
this is preferable for several reasons including the need for 
intimating the programmer's intent and also giving the compiler 
an opportunity to perhaps compile such join patterns using a more 
specialized mechanism than STM. 
An interesting avenue of future work suggested by one of the 
anonymous reviewers is to consider the reverse experiment i.e. 
use an optimistic implementation of join-calculus primitives in 
conjunction with monitors and condition variables to try and 
implement the Haskell STM mechanism. Our intuition is such an 
approach would be much more complicated to implement. We 
believe the value of the experiment presented in this paper is not 
to do with the design of an efficient join pattern library but rather 
to show that STM may be a viable idiom for capturing various 
domain specific concurrency abstractions. 
Although a Haskell based implementation is not likely to enjoy 
widespread use or adoption we do believe that the model we have 
presented provides a useful workbench for exploring how join 
patterns can be encoded using a library based on higher order 
combinators with a lock free implementation. Higher order 
combinators can be encoded to some extent in conventional 
languages using constructs like delegates in c'. Prototype 
implementations of STM are available for some mainstream 
languages e.g. Join Java [I61 and SXM [I21 for c'. When 
translating examples from the Haskell STM world into languages 
like C' which rely on heavyweight operating system threads one 
may need to introduce extra machinery like threadpools which are 
not required in Haskell because of its support for a large number 
of lightweight threads. 
REFERENCES 
[ l ]  Agha, G. ACTORS: A model of Concurrent computations in 
Distributed Systems. The MIT Press, Cambridge, Mass. 
1990. 
[2] Discolo, A., Harris, T., Marlow, M., Peyton Jones, S., Singh, 
S. Lock Free Data Structures using STM Haskell. Eigth 
International Symposium on Functional and Logic 
Programing (FLOPS 2006). April 2006 (to appear). 
[3] Appel, A. Compiling with Continuations. Cambridge 
University Press. 1992 
[4] Benton, N., Cardelli, L., Fournet, C. Modem Concurrency 
Abstractions for c'. ACM Transactions on Programming 
Languages and Systems (TOPLAS), Vol. 26, Issue 5,2004. 
[5] Birrell, A. D. An Introduction to Programming with Threads. 
Research Report 35, DEC. 1989. 
[6] Chaki, S., Rajamani, S. K., Rehof, J. Types as models: 
Model Checking Message Passing Programs. In Proceedings 
of the 29th Annual ACM SIGPLAN-SIGACT Symposium 
on Principles of Programming Languages. ACM, 2002. 
[7] Chrysanthakopoulos, G., Singh, S. An Asynchronous 
Messaging Library for C#. Synchronization and Concurrency 
in Object-Oriented Languages (SCOOL). October 2005. 
[8] Conchon, S., Le Fessant, F. JoCaml: Mobile agents for 
Objective-Caml. In First International Symposium on Agent 
Systems and Applications. (ASA'99)lThird International 
Symposium onMobile Agents (MA'99). IEEE Computer 
Society, 1999. 
[9] Daume 111, H. Yet Another Haskell Tutorial. 2004. Available 
at http:llwww.isi.edu/-hdaurne/htutl or via 
http://www.haskell.org/. 
[lo] Fournet, C., Gonthier, G. The reflexive chemical abstract 
machine and the join calculus. In Proceedings of the 23rd 
ACM-SIGACT Symposium on Principles of Programming 
Languages. ACM, 1996. 
[ l l ]  Fournet, C., Gonthier, G. The join calculus: a language for 
distributed mobile programming. In Proceedings of the 
Applied Semantics Summer School (APPSEM), Caminha, 
Sept. 2000, G. Barthe, P. Dybjer, , L. Pinto, J. Saraiva, Eds. 




















1 ] 1 ] C#.
#
[I] a, . : el f c rre t c tati s i
.
.
[ ] is l , ., rris, ., rl , ., t J s, ., i ,
[ ] t , ., r lIi, ., r t, . rn rr
C#.
, , .
[ ] irr ll, . . I tr ti t r r i it r s.
.








[ ] III, . t t r ll t ri l. . il l







[12] Guerraoui, R., Herlihy, M., Pochon, S., Polymorphic 
Contention Management. Proceedings of the 19th 
International Symposium on Distributed Computing (DISC 
2005), Cracow, Poland, September 26-29,2005. 
[I31 Harris, T., Marlow, S., Jones, S. P., Herlihy, M. Composable 
Memory Transactions. PPoPP 2005. 
[I41 Hewitt, C. Viewing control structures as patterns of passing 
messages. Journal of Artificial Intelligence 8, 3, 323-364. 
1977. 
[15] Igarashi, A,, Kobayashi, K. Resource Usage Analysis. ACM 
Transactions on Programming Languages and Systems 
(TOPLAS), Volume 27 Issue 2,2005. 
[16] Itzstein, G. S, Kearney, D. Join Java: An alternative 
concurrency semantics for Java. Tech. Rep. ACRC-01-001, 
University of South Australia, 200 1. 
[17] Liskov, B. A History of CLU. ACM SIGPLAN Notices, 
28:3, 1993. 
[18] Lea, D. The java.util.concurrent Synchronizer Framework. 
PODC CSJP Workshop. 2004. 
[19] Ousterhout, J. Why Threads Are A Bad Idea (for most 
purposes). Presentation at USENIX Technical Conference. 
1996. 
[20] Odersky, M. Functional nets. In Proceedings of the European 
Symposium on Programming. Lecture Notes in Computer 
Science, vol. 1782. Springer-Verlag, 2000. 
[21] Peyton Jones, S., Gordon A., Finne S. Concurrent Haskell. In 
23rd ACM Symposium on Principles of Programming 
Languages (POPL'96), pp. 295-308. 
[22] Peyton Jones, S., Wadler, P. Imperative functional 
programming. In 20th ACM Symposium on Principles of 
Programming Languages (POPL'93), pp. 71-84. 
[ 2] rr i, ., rli , ., , ., l r i
t ti t. r i s f t t
t ti l i i t i t ti
), r , l , t r - , .
[1 ] i , ., l , ., , . ., , .
r r ti . .
[1 ] itt, . i i t l t t tt i
s. l ti i i l t lli , , 4.
977.
[IS] i, ., .
,
[ 6] t , . , , . :
. . -OI-OOI
i t I.
[ ] is , . ist r f . I ti s,
: , .
[ ] , . j . til.concurrent r i r r r .
. .
[ ] t r t, 1. r r I (f r t
. l .
.
, . ti l t . i s f t
i i . t t i t r
i , l. . i l , .
, ., ., . r t .
f r i g
' -308.
ti
f
' , -84.
