Challenges to Providing Performance Isolation in Transactional Memories by Craig Zilles & David H. Flint
Challenges to Providing Performance Isolation in Transactional Memories
Craig Zilles and David H. Flint
Dept. of Computer Science, University of Illinois at Urbana- Champaign
201 N. Goodwin Ave. Urbana, IL 61801
[zilles, dﬂint]@cs.uiuc.edu
Abstract
Due to the inevitability of chip multiprocessors and the
difﬁculty of parallel software development, there has been
widespread interest in techniques that facilitate parallel
programming. Recently, there have been a number of pro-
posals regarding hardware support for transactional pro-
gramming models. A key advantage of transactional pro-
gramming models over lock-based synchronization is that
two critical sections may be executed in parallel if they do
not conﬂict, without the programmer needing to implement
ﬁne-grain locking. In effect, transaction hardware provides
an implicit system-wide lock that is speculatively elided.
Inthispaper,we identifythreefactorsthatallowonepro-
cess to impact the performance of other concurrently run-
ning, independent processes to a degree not present in tra-
ditional (non-transactional) multiprocessors: (1) forward
progress is deﬁned in terms of user-speciﬁed transactions
(rather than instructions), (2) inefﬁciencies in verifying the
independenceof critical sections with large data footprints,
and(3)subsumptionofkernelexecutioninuser-modetrans-
actions. We describe and discuss speciﬁc problematic sce-
nariosin thecontextofexistingtransactionalhardwarepro-
posals. We believe these scenarios should be considered in
architecting a robust transactional-based system.
1. Introduction
With power becoming the primary implementation con-
straint in microprocessors, there has been a shift in focus
in commercial processors from single-thread performance
(whose power scales exponentially with performance) to
multi-core designs (whose power and performance scales
linearly with the number cores). The beneﬁts of multi-core
designs, however, rely on the presence of thread-level par-
allelism (TLP) in the workload. This TLP is not to be taken
for granted, as parallel software development has all of the
difﬁcultiesofsequentialsoftwaredevelopmentandthechal-
lenges of correctly synchronizing among threads.
The traditional approach to synchronization involves
identifying critical sections — regions of code that ac-
cess potentially shared data in which race conditions may
produce undesirable results — and ensuring race-free ex-
ecution by using a synchronization variable (e.g., a lock)
to provide mutually exclusive access to the data. Im-
plementing synchronization presents a number of chal-
lenges: 1) correctness requires that all accesses to a given
data item must be protected by the same synchroniza-
tion variable, 2) ensuring sufﬁcient concurrency may
require ﬁne-grain locking, which requires more pro-
grammer effort, and 3) avoiding deadlock requires that
when multiple synchronization variables must be ac-
quired, all critical sections must acquire them in the same
order.
In light of the challenges of parallel programming,com-
puter architects have sought techniques that help alleviate
the burden of parallelization that they have thrust upon the
programmer.In particular,there have been a numberof pro-
posals regarding hardware support for transactional mem-
ory programming models [1, 4, 6, 11].
With transactional memory, transactions can be used to
atomically execute a critical section. Transactions are arbi-
trary code sequences delimited by markers that begin and
end the transaction.When a transaction completes, all loads
and stores within the transaction will appear to have oc-
curred at the time of the commit. To achieve concurrencyin
the presence of potentially conﬂicting transactions, trans-
actions must be executed speculatively. When no conﬂict
occurs, the transaction can be committed, writing specula-
tively buffered stores to architected memory. If a conﬂict is
detected, one or more transactions will need to be aborted.
On an abort, the state of the affected process is rolled back
to the state at the beginning of the transaction, with no
stores having been committed. At this point the transaction
can be restarted, perhaps with provisions to ensure forward
progress.
The key advantage of transactional memory program-
ming models over traditional lock-based synchronization is
that they obviate the need to assign synchronization vari-
ables. Although programmersmust still identify the bound-aries of critical sections, transactions shift the burden of en-
suring mutual exclusion from the programmer to hardware.
Furthermore, multiple non-interfering transactions may be
executed concurrently, without the programmer needing to
implement ﬁne-grain locking.
Although there are a number of intriguing advantages
to machines that support transactional programming mod-
els, these machines are very different from existing systems
and a number of questions remain as to how they should be
architected. In this paper, we consider how a transactional
programming model impacts the property of performance
isolation, the ability of a machine to prevent the behavior
of one thread/process from impacting the performance of
otherthreads/processes[12]. In Section 2, we provideback-
ground on the concept of performance isolation and how it
can be achieved in traditional (non-transactional)machines.
Much of the difﬁculty in achieving performance isola-
tion in transaction memory machines revolves around sup-
port for “unbounded” transactions. Although it is generally
anticipated that most transactions will be small, recent pro-
posals explore mechanisms to support “unbounded” trans-
actions — those whose run time is not limited and whose
memory footprint is bounded only by physical or virtual
memory. On the face of it, unbounded transactions seem
compelling, as we want to architect transactions in a way
that will enable portability across hardware implementa-
tions. Furthermore,if we are to specify limitations on trans-
actions,usefullimitationsare likelytobein termsofa given
memory footprint and/or number of dynamic instructions,
two metrics that are difﬁcult to reason about at the source
code level.
Nevertheless, the implications of unbounded transac-
tions are subtle. Transactional memory systems can differ
from non-transactional systems in three important ways: 1)
the granularityof forward progress is determined by the ap-
plication, 2) the need to verify the independence of trans-
actions, and 3) the potential for kernel data structures to be
“held” by user code. We present these differences of trans-
actional memory systems in consideringthree challenges to
achieving performance isolation:
1. How a process can hog physical memory (Section 3),
2. Howlarge,longrunningtransactionscanpreventorin-
hibitthe forwardprogressofother,unrelatedprocesses
(Section 4), and
3. How user-space transactions can inhibit other pro-
cesses from receiving kernel services (Section 5).
For concreteness, we consider these challenges in
the context of three previously proposed implementa-
tions of transactional memory. The ﬁrst two difﬁculties are
observed in Transactional Memory Coherence and Con-
sistency (TCC) [4] and Limited Transactional Memory
(LTM) [1], which, for efﬁciency, extend the cache coher-
ence mechanism to detect conﬂicts. In this discussion, we
identify the potential that a directory-based coherence pro-
tocol has to isolate transactions from non-communicating
processes. For the last challenge, we consider an Un-
bounded Transactional Memory (UTM) [1] machine that
allows user-mode transactions to use kernel services, to ex-
plore the interaction between unbounded transactions, for-
ward progress guarantees, and allowing kernel services
in user-mode transactions. We conclude with a discus-
sion (in Section 6) that considers whether unboundedtrans-
actions should be architected — “large,but bounded”trans-
actions is a yet unconsidered alternative — or whether
such transactions should be allowed, but with weaker for-
ward progress guarantees.
2. Background: Performance Isolation
Performance isolation describes the degree to which a
machine can prevent the behavior of one thread (or pro-
cess) from impacting the performance of another thread. In
the context of a multiprocessor machine, a good proxy for
performance isolation is whether one thread executing on
a multiprocessor with N cores is guaranteed to perform at
least as well as it would running alone on a machine with
1/Nth as many resources (i.e., 1 core, 1/Nth the physical
memory, 1/Nth the memory bandwidth, etc.).
To achieve performance isolation, the allocation of sys-
tem resources needs to be out of the hands of application
processes. When the machine and/or system software can
control resource allocation, attempts by a process to use a
large amount of resources will cause that process to slow
down, without impacting other processes. If performance
isolationis not achieved,then a system is vulnerableto “mi-
croarchitectural denial-of-service attacks” [3].
Conventional (non-transactional) multiprocessor sys-
tems can be made to have a high degree of performance
isolation. Below, we discuss a number of system re-
sources and how conventional systems can control their
allocation. Generally, these mechanisms can be imple-
mented to have a gradual impact (i.e., the performance
penalty is slight when a process only slightly misbe-
haves).
1. CPU time: CPU time is assigned to processes by the
scheduler in the operating system. A process can be
forced to relinquish a processor by using an interrupt
(timer or interprocessor).
2. I/O bandwidth: As access to I/O devices is generally
under the control of the operating system, the OS can
allocate these resources as it sees ﬁt. If a process tries
to exceed its allocation (by requesting many I/O’s in a
short period of time), its time slice can be shortened to
keep it’s I/O rate in line with that of other processes.
3. Physical memory: Allocation of physical memory is
in control of the operating system, which can assigneach process a budget of pages. If the working set
of a process signiﬁcantly exceeds its page budget, the
process will frequently fault and will be descheduled
while an I/O is requested.
4. Memory bandwidth: Allocation of memory band-
width is controlled by the hardware. If there is con-
tention for a link or a bank, a round-robin policy can
be used to ensure that each process is allocated its fair
share.
5. Shared caches: The advent of chip multiprocessors
has led to the sharing of caches between cores. These
shared caches can be partitioned (either dynamically
or statically) to prevent threads from signiﬁcantly im-
pacting each other’s miss rates [7].
Inthese ways,a processcanbethrottledbackso as to use
only its share of resources. A misbehaving process can be
allowed to continue to make forward progress without neg-
atively impacting other processes.
3. Granularity of Forward Progress
The ﬁrst challenge to performance isolation in systems
that support transactional programming models is that the
granularity of forward progress is determined by the appli-
cation. Whereas in a conventional machine, an execution
can be stepped forward one instruction at a time, transac-
tions must be atomically committed.
Importantly, the transactional programming model pro-
vides no alternative to achieving forward progress. Un-
like techniques like Speculative Lock Elision (SLE) [10],
which achieve a transactional-memory implementation of
a conventionally-synchronized critical section, there is no
non-transaction implementation to fall back on. We can-
not expect the programmer to produce two copies of the
code (one transactional, one conventionally synchronized),
as that negates all of the beneﬁts of the transactional pro-
grammingmodel. As a result, an application determines the
granularity of steps its execution can progress in, and the
system can only control when (and if) these steps will be
taken.
To see how this can affect performance isolation, con-
sider the allocation of physical memory to processes. In a
traditionalarchitecture,aprocesscanbeguaranteedtomake
forward progress with a physical memory page budget of
only a few pages; some RISC architectures require as lit-
tle as 2 pages, not counting the page table. As a result, the
operating system has almost complete ﬂexibility is decid-
ing how many pages of physical memory should be allo-
cated to a process.
In contrast, consider the proposed LTM implementation.
Because this machine uses the cache coherence protocol
to detect conﬂicts between transactions, the whole transac-
tion must be resident simultaneously in physical memory.
To support “unbounded” transactions, the operating system
has to be preparedto allocate the whole virtual memoryim-
age of a process into physical memory (provided there is
space).
In LTM machines, unbounded transactions introduce a
coupling between physical and virtual memory allocations
by the operating system 1. In traditional systems, the oper-
ating system can allocate a large amount of virtual memory
— necessary to make forward progress — to a process be-
cause it is a cheap resource. It can then limit the amount of
physical memory — an expensive resource — it allocates
to the process. With a small allocation of physical memory,
a non-transactional process can make forward progress, al-
beit at a slower rate. In a transactional system, a minimum
physical memory allocation may be required to make for-
ward progress.
In fact, given a chance, an LTM process can require up
to twice as much physical memory as it has been allocated
virtual memory. Transaction stores must be buffered spec-
ulatively until the transaction commits. Until commit, the
system must retain both copies of the data. For small trans-
actions, LTM holds the speculative copy in the processor’s
cache and the architected versionin memory.When the size
of a transaction exceeds the size of the cache, LTM uses
an overﬂow table in main memory (provided by the operat-
ing system) to hold the data that does not ﬁt in the cache.
This overﬂow table can grow to be as large as the applica-
tion’s virtual memory space.
Of course, the operating system is never obligated to
allocate resources so as to let a process make forward
progress;itcouldkill theprocessat anytime.Whatis differ-
ent with systems with unboundedtransactions is that an ap-
plicationcanput the operatingsystem in a position in which
it has to choose whether it is going to provide the applica-
tion the necessary resources — bounded only by the virtual
memory size of the application — or to refuse the process
forward progress; there is no compromise.
4. The Burden of Verifying Transaction Inde-
pendence
As noted in Section 1, the fundamental advantage of
the transactional programming model is that it obviates the
need to assign synchronizationvariables to critical sections.
Programmers are no longer required to identify which criti-
cal sections can conﬂict; this becomes the responsibility of
the hardware.
In relieving the programmer of the burden of identify-
ing which critical sections could conﬂict, we lose the in-
formation about which critical sections deﬁnitely do not
conﬂict. Without such information, they will all need to
1 This coupling can be avoided by allowing transaction state to be
paged, as is done in UTMbe checked for conﬂicts, even if they belong to processes
with non-overlapping virtual address spaces. If this veri-
ﬁcation becomes a bottleneck, two processes can impact
each other in ways that have no analog in conventional
(non-transactional)machines.Supportforunboundedtrans-
actionspresentsa challengetoensuringthat thisveriﬁcation
does not become a bottleneck.
In this section, we review how the proposed hardware
transactional memory systems implement transactions in a
cost effective manner. We then demonstrate the challenges
of performance isolation in TCC and LTM in the presence
of large, long running transactions.
4.1. Hardware Support for Transactions
Providing support for transactions in hardware requires
two mechanisms: 1) the ability to buffer stores for commit
at transaction completion time, and 2) the ability to moni-
tor the addresses of loaded values to identify conﬂicts. Pro-
posedimplementationsprovidethese mechanisms in a cost-
effective way by augmenting existing structures and mech-
anisms. The cache can be used to hold speculative transac-
tion state by tagging lines that have been dirtied by an un-
committed transaction. Conﬂicts can be detected by track-
ing which blocks in the cache have been read from and
using the cache coherence protocol to abort a transaction
when the necessary permission for a cache line involved
in the current transaction is lost. In this way, transactions
that ﬁt within the cache can be supported quite efﬁciently.
To support larger transactions, each proposed implementa-
tion falls back on a less efﬁcient mechanism that can sup-
port arbitrarily-sized transactions.
In TCC, transaction state can only be buffered in the
cache and its associated victim buffer. Cache lines that have
been read or modiﬁed by a transaction “may not be ﬂushed
in TCC from the local cache hierarchy in mid-transaction.”
If cache conﬂicts or capacity necessitate ﬂushing to make
forward progress, then TCC must request commit permis-
sion in the middle of the transaction. Once commit per-
mission is obtained the transaction can be completed non-
speculatively: the existing transaction state can be commit-
ted and conﬂicts are prevented by holding commit permis-
sion until the transaction completes execution. Until this
transaction commits, other processors are not allowed to
commit their transactions.
In LTM, when a transaction’s memory footprint exceeds
what the cache can hold, evicted blocks are written to
an overﬂow hash table: a region of uncacheable memory
whose size and location are determined by the operating
system. Overﬂow bits are added to each set of the cache;
an overﬂow bit is set when a cache line of a transaction has
to be evicted from the cache because of capacity or lim-
ited associativity. When the cache receives an intervention
for a line that maps to an overﬂowed set, then “the over-
address   ﬁrst address to load
B   size of cache block in bytes
last   address + 2  B  N flast address to loadg
BEGIN TRANSACTION
while address  last do
read MEM[address]
address   address + B
end while
while true do
do nothing finﬁnite loopg
end while
END TRANSACTION
Figure 1. A transaction that overﬂows every set of a
cache with N blocks of size B-bytes each (by read-
ing 2N words of data) then enters an inﬁnite loop.
ﬂow hash table is searched for the requested line.” The time
to performthis search dependson the numberof conﬂicts in
the hash table, which in turn depends on how densely popu-
lated the hash table is. At minimum,servicingsuchan inter-
vention requires one round-tripmemory access time. While
handling an overﬂow, the proposed LTM machine NACKs
all incoming cache interventions.
4.2. TCC and unbounded transactions
As proposed, TCC provides no performance isolation in
the presence of large, long running transactions; one pro-
cess can completely prevent other processes from making
forwardprogress.ConsiderthecodefragmentshowninFig-
ure 1. This transaction has two phases: 1) it reads enough
data to overﬂow the transaction hardware, and 2) it enters
an inﬁnite loop so the transaction is never committed.
In the ﬁrst phase of the transaction, enough data is read
so that the processor’s transaction support is exceeded. At
this point,executionofthetransactionis stalled andthepro-
cessor arbitrates for commit permission, so that the transac-
tion can be completed non-speculatively. Eventually, com-
mit permission will be granted along with exclusive access
to the memory bus. If the process never ends its transaction
(the second phase of the transaction), the bus lock is never
released. This would prevent all other processes from com-
mitting or even making progress past a cache miss or up-
grade. As proposed, a TCC machine will allow this attack
to lock up a machine indeﬁnitely with no mechanism for it
to be interrupted. In fact, a TCC machine could be dead-
locked by overﬂowing as little as one set of the cache and
the victim buffer.
Clearly, deadlock can be avoided by including a watch-
dog timer and aborting the partially committed transaction
if it holds commit permission too long. This potentially
leaves the process’s memory in an inconsistent state. As a
result, it is likely that the associated process will need to bekilled to prevent this inconsistent state from being propa-
gated to persistent state.
4.3. LTM and unbounded transactions
On the LTM machine, the ﬁrst phase of the transaction
will similarly overﬂow the transactional hardware support
of the processor. When such an overﬂow occurs, the LTM
machine marks the overﬂowing set and spills transaction
state to an uncached region of main memory. To commit
a transaction, the LTM processor has to retain (for the du-
ration of the transaction) read and write permission to all
cache blocks that are loaded from and stored to2.
To check for conﬂicts with a transaction, the LTM ma-
chine will need to check every incoming coherence request
to see if it will require the transaction to relinquish permis-
sion to one of its blocks. As the overﬂow transaction state
is stored in a hash table, the hardware does not know a pri-
ori how many hash table entries will need to be checked. At
least one main memory access will be required for each co-
herence request to an overloaded set.
If all sets have overﬂowed, then every incoming coher-
ence request will trigger at least one additional main mem-
ory access. Because other incoming coherence requests are
being NACKed during an overﬂow, coherence requests are
nolongeroverlapped;thecoherencethroughputofthis node
drops to the inverse of main memory latency. In a system
with a snooping coherence protocol, all coherence requests
aresentto everyprocessor,bringingmemorysystemperfor-
mance to a crawl when any processor overﬂows many sets
of its cache.
Although the authors do not identify this as an impor-
tant consideration, LTM was evaluated in the context of
a directory-based multiprocessor. In a directory-based ma-
chine, the directory maintains a list of potential sharers, and
interventions are only sent to those processors that could
have a cached copy of the line in question. A directory-
based coherence scheme largely prevents a process in one
address space from affecting others, because there will be
few3 cache interventions between the processes. In effect,
the directory is tracking the physical address space overlap
of two processors.
2 It should be noted that this is only one means to ensuring the ap-
pearance of atomic execution. Strictly speaking, the permissions to
the read and write sets only need to be held simultaneously at com-
mit time. If we record the address/value pairs of all loads and stores
within the transaction, then permission to blocks can be lost without
aborting the transaction. If blocks are lost, they must be reacquired at
commit time and any read values must be re-validated and stored val-
ues re-written. The more conservative scheme proposed in LTM can
be easily justiﬁed because it likely performs as well in the common
case and is simpler and has lower storage requirements.
3 A few interventions are possible on a system with physically-
addressed caches, because caches are generally not ﬂushed on con-
text switches. The number of interventions, however, is bounded by
the size of the cache.
It is important to note that directory schemes can ef-
fectively ﬁlter coherence requests only to the degree that
transaction state of individual threads can be distinguished
by the directory. In the context of chip-multiprocessors—
oneof the major motivationsfor transactionalprogramming
models—it is common for lower-levels of the cache hier-
archy to be shared. If a shared L2 is used to hold trans-
action state, then, when one processor’s transaction over-
ﬂows the L2 cache, all requests to the overﬂowed sets of
the cache (both requests from the CMP’s cores and exter-
nal interventions for data on any core of the CMP) will be
penalized. The same problem occurs if transaction state is
limited to the L1 cache and a hierarchical coherence proto-
col that uses a snooping protocol for intra-CMP coherence
is used. This problem occurs also with multithreaded pro-
cessors that share an L1 data cache.
Achieving performance isolation in the presence of un-
boundedtransactionsappears to requirelimitingtransaction
state to unshared caches and using a ﬂat directory-basedco-
herence protocol that recognizes all of the cores indepen-
dently. Only in this way can one processor retain coherence
permissionto morememorythan it can cache (so that it will
receive interventions for all data that belong to the transac-
tion),butwill notreceiveinterventionsfordatait hasnotac-
cessed. Hierarchical coherence schemes, as they have gen-
erallybeenproposed,fallshortofthismark,evenwhenthey
use directory protocols for intra-CMP coherence. Typically,
the on-chip directory (e.g., Piranha’s duplicate L1 tags [2])
aresizedtoonlymaintaindirectoryinformationforwhatthe
caches can hold. When a transaction overﬂows its cache, it
would also overﬂow its local directory, requiring snooping
the overﬂowed transaction state for all coherence requests
arriving at this chip that map to overﬂowed sets.
In one of the above scenarios, when a directory can only
distinguish groups of nodes, impact to the whole machine’s
performance can be avoided by sacriﬁcing one group. Con-
sider a system composed of CMPs in which the intra-CMP
coherenceis maintainedbyasnoopingprotocol,buta direc-
tory is used for inter-CMP coherence. If one core of a CMP
overﬂows its cache, the coherence trafﬁc of the threads on
the CMP and any threads that share with them could be im-
pacted. To isolate the overﬂowing thread, all other cores on
the CMP could be halted (for the duration of the transac-
tion) and the caches ﬂushed of non-transactional data. This
technique bounds the impact that one thread can have on
the whole machine. Even in the context of a machine with a
full directory, if the cores are multithreaded (SMT or other-
wise), it may be necessary to halt the other threads running
on a core with an overﬂowing transaction.
LTM does boundthe length of time a transaction can run
to the length of a time slice, but this provides little help. If a
transaction has not completed at the end of its time slice, it
is aborted and restarted the next time the process is sched-
uled. Because it takes little time to overﬂow a cache, a pro-cess can spend almost all of the time it is scheduled in the
overﬂowed state. Again, the operating system can choose
not to schedule the process, but there is no means to grant
theprocesstheabilitytomakeforwardprogresswithoutim-
pacting the system as a whole.
5. Sharing through the Kernel
One challenge in implementinga transactional program-
ming model that supports long running transactions is en-
suring forward progress. In order to commit, a long run-
ning transactionneeds to retain coherencepermission on all
cache lines it touches. When other transactions are modi-
fying the same data, the long running transaction is at risk
of starvation. To avoid starvation, a number of transactional
memory schemes have proposed recording the time a trans-
action is started the ﬁrst time, and, when a conﬂict is de-
tected, aborting the younger transaction using these times-
tamps [1, 9].
Such attempts to avoid starvation of long transactions
can actually allow an unboundedtransaction to starve other
transactions. If conﬂicts are resolved by aborting or stalling
the younger transaction, then one thread could starve an-
other thread in the same virtual address space by perform-
ing a transaction that touches every cache block in the vir-
tual address space. Eventually, this transaction will become
theoldest,andallotherthreadswouldbepreventedfromex-
ecuting memory operations. Such a denial-of-service is not
a practical concern, because, if a malicious thread had ac-
cess to the victim’s address space, it could corrupt data—a
much worse threat. That said, if care is not taken, such a
denial-of-service could occur between two processes from
different virtual address spaces, because all processes share
the same kernel data structures.
The proposed Unbounded Transactional Memory
(UTM) [1] supports the nesting of transactions, in which
the outer transaction subsumes the inner transaction. If sys-
tem calls are allowed to be performed within a transac-
tion, then kernel critical sections could be subsumed into
user-mode transactions. Although the transaction can-
not directly modify the kernel data structures, it could
prevent other processes from updating them for arbi-
trary amounts of time.
If transactionscan subsume kernelexecution,then a pro-
cess could start a transaction, perform a number of system
calls attempting to modify key kernel data structures (e.g.,
the kernel memory allocator, the allocator of disk blocks,
message buffers for an Ethernet adapter, etc.), and then
nevercommit.Whenanotherprocessrequiresthe same data
structure, a conﬂict would be detected and the “attacking”
process would likely be older, so UTM would stall or re-
peatedly abort the transaction of the “victim” thread, caus-
ing it to fail to make forward progress. Although modern
kernels are designed to minimize accesses to system-wide
data structures, these structures are not eliminated, and they
will eventuallybe modiﬁedbyaprocessthat repeatedlyper-
forms system calls, due to the limited resources allocated to
the per-processor data structures.
Althoughthis problemcouldbe resolvedby notallowing
user-modetransactions to includesystem calls, this does re-
strict the power of transactional programming. First, pro-
grammers would have to request any resources (e.g., al-
locating virtual memory, opening/mmaping ﬁles, reading
from sockets) before starting the transaction; this could be
difﬁcult, especially for very long running transactions. Sec-
ond, any side effects from the transaction that go beyond
user memory (e.g., writes to sockets, renaming of ﬁles)
would have to be buffered within the application then per-
formed in earnest after the transaction has completed; this
approach presents two difﬁculties: 1) the kernel has pro-
vided no guarantee that these actions can be completed suc-
cessfully, 2) there is no guarantee on when the actions will
be completed, weakening the atomic nature of transactions.
The problem can be somewhat mitigated by resolving
conﬂicts between transactions by priority rather than (or in
addition to) age, as was suggested by Herlihy and Moss [6].
If higher priority transactions are never aborted by lower
priority ones, then at least priority inversion—in which a
lower priority thread starves a higher priority thread—will
not occur.
6. Discussion
In this section, we question whether unbounded transac-
tions are a reasonable goal. Clearly it is important to not
limit transactions to ﬁt in a small cache with a given asso-
ciativity, but does that necessitate that unbounded transac-
tions should be supported?
In the absence of performance isolation, systems will
not guaranteeforward progress for unboundedtransactions.
In some of the scenarios previously discussed, truly un-
bounded transactions with forward progress guarantees can
permanently prevent other threads from making forward
progress. It seems inevitable that watchdog timers would
be instituted, causing a trap to the operating system when a
thread was being starved, so that an abusive thread could be
killed.
Note that systems will not be able to distinguish an abu-
sive process from a well-intended one that happens to use
very large transactions. Although we have illustrated the
problematic scenarios in their simplest form, they could be
arbitrarily obfuscated so as to be indistinguishable from a
useful program.
If unbounded transactions will not be guaranteed to be
executed, should they be architected at all? Is it reasonable
to expect a transactionalmemorysystem to supporttransac-
tions that touch hundreds of megabytes of data and run for
minutes of execution time? Is supporting such transactionsworth restricting the transactional programming model in
other ways?
For example, the recent VTM proposal [11] is not sensi-
tive to the scenarios described in Section 3 and Section 5,
but only because the machine does not support transactions
that extend past the user space of a single process (no sys-
tem calls or shared memory regions are supported).
If it is impossible to simultaneously support un-
bounded transactions, starvation avoidance, and trans-
actions that communicate with the kernel and/or other
processes, which should be supported? From our per-
spective, unbounded transactions appear to be the least
useful. For the rare, very large, critical sections that pro-
grammers want to implement, traditional lock-based
synchronization is an option. In contrast, starvation avoid-
ance and system calls/interprocess communication will be
hard to support if not provided by the transactional mem-
ory directly.
Alternatively, “unbounded” transactions can be sup-
ported, but without the same quality of forward-progress
guarantee. Among threads of an application or set of co-
operating applications, contention can be resolved through
a software “contention manager” that selectively sched-
ules a subset of the threads to ensure forward progress
at the application level [5, 8]. Forward progress guaran-
tees, however, are harder to realize when the contention
occurs between two independent applications, as can oc-
cur when transactions require kernel services. For such
circumstances, a forward progress guarantee appears un-
realistic, and such transactions can be executed with “best
effort.”
From our standpoint, unconstrained transactions appear
to be most useful early in the software development pro-
cess, a time when performance isolation is largely a non-
issue. During development, forward progress can be guar-
anteed at the expense of performance isolation to enable
functional testing. In the testing process, potentially difﬁ-
cult transactions can be identiﬁed for refactoring by the de-
velopers prior to the program’s distribution.
7. Conclusion
This paper shows that transactional memory systems
present new challenges to achieving performance isola-
tion. In particular,transactional memorysystems have three
novel aspects: 1) the granularity of forward progress (i.e.,
the transaction) is controlled by the application, 2) the bur-
den of verifying the independence of transactions, and 3)
the potential for kernel critical sections to be subsumed by
user transactions. We have demonstrated how each of these
aspects present challenges to performance isolation in the
context of previous transactional memory proposals.
Although we do not present a complete solution to these
problems, two insights into the problem are provided.First,
if implemented properly, directory protocols can be used
to isolate the transactional state of one process from oth-
ers, preventing an overﬂowing transaction from impacting
threads that do not share with it. Second, a fundamentaldif-
ﬁculty is encounteredwhen trying to achieveall three of the
following: unbounded transactions, forward progress guar-
antees, and the potential for contention between indepen-
dent processes (as could occur through the kernel). This
difﬁculty suggests that large transactions be architected
with only “best effort” forward progress guarantees, with
application-levelsupportformanagingcontentionwithinan
application and feedback to the developer on large transac-
tions that potentially conﬂict through the kernel. Regard-
less, performanceisolation will need to be considered in ar-
chitecting transactional programming models.
8. Acknowledgments
This research was supported in part by NSF CCR-
0311340 REU, NSF CAREER award CCR-03-47260 and
a gift from the Intel Corporation. We thank Milo Mar-
tin, Ravi Rajwar, Pierre Salverda, and the anonymous
reviewers for feedback on previous drafts of this pa-
per.
References
[1] C. S. Ananian, K. Asanovi´ c, B. C. Kuszmaul, C. E. Leis-
erson, and S. Lie. Unbounded transactional memory. In
Proceedings of the Eleventh IEEE Symposium on High-
Performance Computer Architecture, pages 316–327, Feb.
2005.
[2] L. A. Barroso, K. Gharachorloo, R. McNamara,
A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets,
and B. Verghese. Piranha: A scalable architecture based on
single-chip multiprocessing. In Proceedings of the 27th An-
nual International Symposium on Computer Architecture,
pages 282–293, June 2000.
[3] D. Grunwald and S. Ghiasi. Microarchitectural denial of ser-
vice: Insuring microarchitectural fairness. In Proceedings of
the 35th Annual IEEE/ACMInternational Symposium on Mi-
croarchitecture, pages 409–418, Nov. 2002.
[4] L. Hammond, V. Wong, M. Chen, B. D. Carlstrom, J. D.
Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis,
and K. Olukotun. Transactional memory coherence and con-
sistency. In Proceedings of the 31st Annual International
Symposium on Computer Architecture, pages 102–113, June
2004.
[5] M. Herlihy, V. Luchangco, and M. Moir. Obstruction-free
synchronization: Double-ended queues as an example. In
Proceedings of the 23rd IEEE International Conference on
Distributed Computing Systems, May 2003.
[6] M. Herlihy and J. E. B. Moss. Transactional memory: Ar-
chitectural support for lock-free data structures. In Proceed-
ings of the 20th Annual International Symposium on Com-
puter Architecture, pages 289–300, May 1993.[7] S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and
partitioning in a chip multiprocessor architecture. In Pro-
ceedings of the International Conference on Parallel Archi-
tectures and Compilation Techniques, Sept. 2004.
[8] R. Rajwar. Personal communication, May 2005.
[9] R. Rajwar and J. R. Goodman. Transactional lock-free exe-
cution of lock-based programs. In Proceedings of the Tenth
International Conference on Architectural Support for Pro-
gramming Languages and Operating Systems, pages 5–17,
Oct. 2000.
[10] R. Rajwar and J. R. Goodman. Speculative lock elision: En-
abling highly concurrent multithreaded execution. In Pro-
ceedings of the 28th Annual International Symposium on
Computer Architecture, pages 294–305, July 2001.
[11] R. Rajwar, M. Herlihy, and K. Lai. Virtualizing transactional
memory. In Proceedings of the 32nd Annual International
Symposium on Computer Architecture, June 2005.
[12] B. Verghese, A. Gupta, and M. Rosenblum. Performance iso-
lation: Sharing and isolation in shared-memory multiproces-
sors. In Proceedings of the Eighth International Conference
on Architectural Support for Programming Languages and
Operating Systems, pages 181–192, Oct. 1998.