HeTM: Transactional Memory for Heterogeneous Systems by Castro, Daniel et al.
HeTM: Transactional Memory for
Heterogeneous Systems
Daniel Castro
INESC-ID. Instituto Superior Técnico
Universidade de Lisboa
Lisbon, Portugal
daniel.castro@tecnico.ulisboa.pt
Paolo Romano
INESC-ID. Instituto Superior Técnico
Universidade de Lisboa
Lisbon, Portugal
romanop@gsd.inesc-id.pt
Aleksandar Illic
INESC-ID. Instituto Superior Técnico
Universidade de Lisboa
Lisbon, Portugal
aleksandar.ilic@tecnico.ulisboa.pt
Amin M. Khan
Department of Computer Science
UiT The Arctic University of Norway
Tromsø, Norway
amin.khan@uit.no
Abstract—Modern heterogeneous computing architectures,
which couple multi-core CPUs with discrete many-core GPUs (or
other specialized hardware accelerators), enable unprecedented
peak performance and energy efficiency levels. Unfortunately,
though, developing applications that can take full advantage
of the potential of heterogeneous systems is a notoriously hard
task. This work takes a step towards reducing the complexity of
programming heterogeneous systems by introducing the abstrac-
tion of Heterogeneous Transactional Memory (HeTM). HeTM
provides programmers with the illusion of a single memory
region, shared among the CPUs and the (discrete) GPU(s) of
a heterogeneous system, with support for atomic transactions.
Besides introducing the abstract semantics and programming
model of HeTM, we present the design and evaluation of a
concrete implementation of the proposed abstraction, which we
named Speculative HeTM (SHeTM). SHeTM makes use of a
novel design that leverages on speculative techniques and aims
at hiding the inherently large communication latency between
CPUs and discrete GPUs and at minimizing inter-device syn-
chronization overhead. SHeTM is based on a modular and
extensible design that allows for easily integrating alternative
TM implementations on the CPU’s and GPU’s sides, which allows
the flexibility to adopt, on either side, the TM implementation
(e.g., in hardware or software) that best fits the applications’
workload and the architectural characteristics of the processing
unit. We demonstrate the efficiency of the SHeTM via an
extensive quantitative study based both on synthetic benchmarks
and on a porting of a popular object caching system.
Index Terms—transaction, memory, CPU, GPU, heterogeneous,
computing, system
I. INTRODUCTION
Single-core performance of central processing units (CPUs)
have reached a plateau in the last decade. In order to enable
further increases of the processing capacity, while attaining
high energy efficiency, modern computing architectures have
henceforth adopted two key paradigms, namely parallelism
and heterogeneity. As a result, nowadays, heterogeneous
architectures that combine multi-core CPUs with many-core
GPUs (or similar co-processors, e.g., TPUs [32]) have become
the de facto standard in a broad range of domains that include
HPC, servers and mobile devices.
Unfortunately, though, developing applications that take
full advantage of the raw performance potential of modern
massively parallel, heterogeneous architectures is a notoriously
hard task. This has fostered, over the last year, intense research
efforts aimed at developing new abstractions and programming
paradigms for reducing the complexity of software development
for modern heterogeneous platforms.
This work focuses on one key problem that arises when
developing concurrent applications, whose complexity is ex-
acerbated when considering massively parallel heterogeneous
architectures, namely how to regulate access to shared data in
a scalable way.
We tackle this problem by introducing the abstraction of
Heterogeneous Transactional Memory (HeTM, pronounced as
hey-tee-em). The HeTM abstraction stands at the intersection
of two well-known paradigms for concurrent programming, i.e.,
Transactional Memory (TM) [24], [49] and Shared Memory
(SM) [34], [33], by providing the illusion of shared memory
regions that can be seamlessly accessed by the CPUs and the
discrete GPUs of a heterogeneous system and whose concurrent
accesses can be synchronized via atomic transactions.
A large body of research has been devoted over the last
years to investigate efficient implementations of both the TM
and SM abstractions. On the SM front, several implementations
by major industrial players in the heterogeneous computing
landscape, e.g., NVIDIA [22] and OpenAcc [14]. These
efforts represent a strong evidence that the industrial world
perceives the benefits, in terms of ease of programming,
stemming from the adoption of the SM paradigm as compelling.
However, existing SM implementations for heterogeneous
systems provide programmers with low-level synchronization
primitives, such as atomic operations and locks, exposing
programmers to another well-known source of complexity: the
need of designing efficient, yet provably correct, techniques to
ar
X
iv
:1
90
5.
00
66
1v
1 
 [c
s.D
C]
  2
 M
ay
 20
19
regulate concurrent access to shared data.
This is a notoriously hard problem, as designing efficient
fine-grained locking protocols is a complex and error prone
task [41] that can compromise one of the key principles
of modern software development processes, i.e., software
composability [23]. TM addresses exactly this problem: thanks
to the abstraction of atomic transactions, programmers only
need to specify which set of operations/code blocks have to be
executed atomically, delegating to the TM implementation the
problem of how atomicity should be achieved. The literature
on TM has been very prolific over the last decade, leading to
the development of a plethora of solutions in software [49],
[15],
hardware [24], [58], [40] and combinations thereof [9]. Ex-
isting TM solutions, however, consider homogeneous systems,
in which threads execute either on CPUs [58], [46] or on
(discrete) GPUs [18], [55]. As such, existing TM systems fall
short in harnessing the full potential of heterogeneous systems,
failing to support execution scenarios in which CPUs and
GPUs cooperate by concurrently accessing and manipulating
the same state [25], [61], [39], [38] — which is precisely the
goal pursued by HeTM.
Building an efficient TM for heterogeneous systems, though,
is far from being a trivial task. In fact, in homogeneous
platforms, where the TM abstraction is confined within the
boundaries of a single processing device, e.g., a multi-core
system or a discrete GPU, conflict detection can be imple-
mented via fast communication channels, e.g., the caches of a
multi-core system. This is what allows existing TM designs
to incur limited overhead, even though they trigger conflict
detection multiple times during transaction execution, possibly
as frequently as upon each memory access of a transaction.
The HeTM abstraction, though, spans physically separated
computational devices, which communicate via channels, such
as PCIe [43], that are orders of magnitude slower than the
ones assumed by conventional TM systems for homogeneous
platforms. In these settings, thus, conventional TM approaches
that impose multiple system-wide synchronizations along the
critical path of execution of each transactions would incur
prohibitive overheads that would cripple performance.
In this work, we tackle this challenge by presenting SHeTM
(Speculative HeTM), the first implementation of the proposed
HeTM abstraction.SHeTM leverages a set of novel techniques
that operate in synergy to effectively mask the latency of the
inter-device communication bus:
Hierarchical conflict detection. SHeTM employs a new, hierar-
chical approach to conflict detection, which aims at removing
inter-device conflict detection from the critical execution path
of each transaction and to amortize its cost across batches of
multiple transactions.
More precisely, SHeTM detects intra-device conflicts, i.e.,
conflicts generated between transactions that execute on the
same computational unit (CPU or GPU), by relying on
conventional TM implementations for homogeneous systems —
an approach that we term synchronous, as conflicts are detected
during transaction execution.
Inter-device conflicts, conversely, are checked asyn-
chronously, i.e., conflicts are detected periodically between
batches of transactions that are concurrently executed and
committed, in a speculative fashion, at different computational
units. In absence of conflicts, the updates of each device are
merged and a consistent replica of the same state is installed
at both of them. If inter-device conflicts are revealed, the
transactions speculatively committed are rolled-back and the
state of the devices whose transactions were discarded is re-
aligned to that of the “winning” device.
Not only the use of speculation and asynchronous inter-
device conflict detection amortizes the performance toll
imposed by the synchronization over a high-latency inter-
connection bus across a large number of transactions. It also
enables the use of embarrassingly-parallel conflict detection
schemes that, by operating on large transaction batches, can
be very efficiently executed by modern GPUs.
Non-blocking inter-device synchronization: Although the cost
of inter-device conflict detection can be amortized over a
batch of transactions, the larger the batch of transactions
processed in a synchronization round, the higher the likelihood
of experiencing conflicts across devices. Thus, in conflict prone
workloads, where it is desirable to use smaller transaction
batches, it is crucial to reduce the overhead of the inter-device
synchronization by minimizing the period of time during which
transaction processing is blocked.
To this end, SHeTM introduces an innovative scheme
that ensures that, even while inter-device synchronization is
being performed, either the CPU or the GPU are able to
process transactions. This goal is achieved by combining two
mechanisms: i) overlapping the GPU-based validation of the
transactions’ batch with the processing of transactions on the
CPU-side; ii) letting the GPU start processing the transactions
of the next synchronization round, while the updates produced
by the transactions it executed in the current round are being
copied back to the CPU.
Conflict-aware dispatching & early validation: SHeTM exposes
a programmatic interface that allows to control the assignment
of transactions to either CPU or GPU. This way, SHeTM
exploits this information to implement a conflict-aware transac-
tion dispatching scheme that aims to reduce the likelihood of
inter-device contentions (which, as mentioned, leads to rolling-
back a batch of speculatively committed transactions at either
device), by dispatching transactions that are likely to contend to
the same device, where conflicts can be detected and resolved
efficiently using the local TM implementation.
Further, in case inter-device contention does arise, SHeTM
employs an early validation scheme that aims at to reduce
overhead (i.e., wasted work) by detecting conflicts before the
synchronization phase for the current round is activated.
Modular and extensible design. SHeTM is designed to ease
integration with generic CPU-based and GPU-based TM
implementations. To this end, SHeTM exposes a simple generic
interface, which a TM need to invoke in order to expose
2
to SHeTM the read-sets and write-sets of the transactions it
speculatively commits.
The ability of SHeTM to incorporate different TM implemen-
tations is quite relevant in practice, given that the design space
of TM is very wide and a number of studies have shown that
no-one-size-fits-all TM implementation exists that can ensure
optimal performance across all possible workloads [56], [10].
This flexibility allows therefore to easily incorporate in SHeTM
additional TM implementations, and to further increase the
robustness of its performance in a wide spectrum of workloads.
We evaluate SHeTM via an extensive experimental study,
based on synthetic benchmarks — which we use to shape
workloads aimed at quantifying the overheads and gains
deriving from the various mechanisms SHeTM employs —
and a real world application, MemcachedGPU [25] — which
allows us to assess SHeTM’s performance with a realistic
workload as well as to showcase the benefits, in terms of
load balancing and ease of programming, stemming from the
possibility of concurrently accessing a common data from
physically separated computational units.
II. RELATED WORK
Existing programming models for heterogeneous systems aim
at providing different abstraction levels to unify the execution
among devices with different architectures, programming
paradigms and memory spaces. These models span from low-
level and user-managed frameworks (such as OpenCL [19])
up to the fully automated run-time systems (e.g., StarPU [1],
OmpSS [13] and Cashmere [26]). Other recent efforts aimed
at simplifying the accelerator programming with high-level
OpenMP-like directives, as highlighted in OpenACC [20] and
OpenMP 4.0 [4]. There are a number of ongoing efforts in
the academia and industry aimed at automating data data
management and support for unified memory in hybrid accel-
erated systems. Examples include compiler-level optimizations
(CGCM [30], Spark-GPU [59] and RSVM [31]), NVIDIA
CUDA Unified Memory [7] or even support at the Linux kernel
level [52]. These solutions share our common high-level of
simplifying the development of applications for heterogeneous
systems. Yet, none of them tackle the challenges involved
in ensuring the consistency guarantees provide by TM [21],
exposing programmers to the notorious complexity of lock-
based synchronization [23].
As mentioned, the literature in the area of TM has elaborated
a plethora of design, exploring both hardware and software
implementations. The majority of the existing literature has
focused on investigating TM implementations for CPUs,
although a number of TM systems for GPUs [18], [27],
[57], [50] have been explored of late. In this area, a relevant
related work is the recent APUTM [54], which addressed the
problem of implementing a STM for integrated GPUs. However,
integrated GPUs reside in the same coherent domain as the
CPU, unlike the case of discrete GPUs — which we target in
our work. As such, developing a TM for integrated GPUs is a
much less challenging endeavor, as, in fact, this problem can be
solved by re-using existing designs for CPU TMs. To the best
of our knowledge, our work is the first to present a TM system
for heterogeneous systems that encompass both CPUs and
discrete GPUs. It is also the first work to revisit the definition
of conventional TM consistency semantics, e.g., [21], [28], to
keep into account the specific architectural characteristics of
heterogeneous systems.
In a broad sense, HeTM is related to the work on speculative
processing in distributed systems. In particular, optimistic
simulation systems [17], [44], where the state of local simu-
lation objects is allowed to advance in a speculative fashion,
i.e. skipping synchronization with remote objects and rolling
back to a consistent state if a posteriori it is detected to have
missed any relevant event from a remote object. Another related
area has been investigated for speculative transaction process-
ing techniques in distributed and replicated databases [45],
[48]. Similarly, in this case the principle is similar, letting
transactions commit speculatively and automatically roll-back
the state of individual database replicas in case of any errors in
speculation. HeTM builds on the same principles, but introduces
new ad hoc designed techniques to meet the characteristics of
heterogeneous systems composed of GPU and CPU.
III. DEFINING THE HETM ABSTRACTION
As mentioned, HeTM provides the illusion of a single
transactional shared memory that is concurrently accessed
by a set of physically separated devices, where devices are
equipped with their own local memory and communicate over
an interconnection bus like PCIe.
In the definition of the HeTM abstraction we do not consider,
for the sake of generality, how transactions are generated
and dispatched to the various execution devices. We leave
the definition of these aspects to concrete implementation
of the HeTM abstraction (see Section IV. We will simply
assume that threads, in execution at any computational device
attached to the HeTM platform, can access and manipulate its
state exclusively by means of transactions. To this end, HeTM
exposes a conventional API, through which threads can start a
new transaction, submit read and write operations and request
the commit or abort of the transaction. Extending the proposed
HeTM abstraction to support intra-transaction parallelism [2],
[60] and non-transactional accesses [36] would be possible,
but it is outside of the scope of this work.
The rest of this section focuses on defining the correctness
semantics that should be expected from a HeTM platform,
such as the one that we will present in Section IV, that
exploits speculative techniques to mask the costs of inter-
device synchronization. More in detail, we intend to reason
on the correctness of TM implementations that can commit
transactions in a speculative fashion, i.e., without first checking
for conflicts with transactions executing on remote devices,
and that may therefore have to be later aborted in case an
inter-device contention is eventually detected.
This speculative transaction execution model — in which
transactions are first speculatively committed based only on
local information, and only subsequently are final committed
(or simply committed) — is desirable, in a HeTM platform,
3
as it allows to remove intra-device conflict detection from the
critical path of transactions’ execution. In fact, in such a model,
upon speculatively commit of a transaction T , i) the thread
that requested T ’s commit can be unblocked and process new
transactions, and ii) T ’s updates can be immediately made
visible to other local transactions. On the other hand, this
speculative execution model also enables a broader spectrum of
concurrency anomalies with respect to conventional transaction
execution models that do not contemplate the notion of
speculatively committed transactions.
We start by observing that existing consistency criteria for
classical TM systems, such as Opacity and Virtual World
Consistency [21], [28] (see Section II), are unfit to capture
the dynamics of the speculative transaction execution model
that we advocate to enable efficient implementations of the
HeTM abstraction. Roughly speaking, existing TM consistency
criteria ensure, with various nuances, two key properties:
• P1. The behavior of every committed transaction has to
be justifiable by the same sequential execution containing
only committed transactions, without contradicting real-
time order.
• P2. The behavior of any active transaction, even if it
eventually aborts, has to be justifiable by some (possibly
different) sequential execution containing only committed
transactions.
We argue that property P1, which specifies the correctness
semantics of committed transactions, remains adequate for the
case of a HeTM system. In fact, to preserve the ease of use of
the TM abstraction, speculation should serve solely to enhance
efficiency, but be totally hidden to applications. As such, the
consistency semantics of committed transactions should remain
unaltered, even if speculation is used for efficiency reasons.
Property P2, on the other hand, appears unfit to define the
consistency semantics of HeTM platforms. In fact, the specifica-
tion of P2 prohibits observing the updates of any uncommitted
transaction, thus including the updates of speculatively com-
mitted transactions. Hence, if a transaction T attempted to read
a data item updated by a speculatively committed transaction
T ′, P2 would oblige any HeTM implementation to block T
until the final outcome (commit/abort) of T ′ is determined —
limiting the effectiveness of speculation to mask the costs of
inter-device synchronization.
Note also that allowing active transactions to observe the
effects of any speculatively committed transaction would not be
a viable solution either. In fact, it would allow a transaction to
observe the effects of two conflicting speculatively committed
transactions. This would defeat the motivation at the basis of
P2: avoiding that applications may fail in complex/unpredicted
ways due to observing a state that no sequential execution
could have ever produced.
Overall, we argue that consistency semantics offered by a
HeTM platform should depart from classical consistency TM
criteria by allowing different devices to use different sequential
transaction histories to justify the execution of their local
transactions. Intuitively, these transaction histories should be
composed by: i) a prefix (possibly of different size) of the
sequential execution history containing committed transactions
(which, by P1, must be the same at each device), followed by ii)
a device-dependent sequential history composed by transactions
that speculatively-committed at that device.
We capture these semantics via this variant of property P2:
• P2†. The behavior of any active or speculative committed
transaction T has to be justifiable by some sequential
execution containing i) committed transactions and ii)
speculatively committed transactions that executed on the
same device as T .
Property P2† and P2 pursue the same high-level goal: guaran-
teeing that the state observed by any transaction T could
have been produced in some sequential execution. Unlike
P2, though, P2† allows to include in the sequential execution
used to justify T ’s execution not only committed transactions,
but also speculatively committed transactions that executed
on the same device as T . This means that transactions that
execute at different devices must observe a common history
of committed transactions, but may witness the effects of
different speculatively committed transactions, which are still
being checked for inter-device conflicts.
Note that P2† requires that also the behavior of speculatively
committed transactions (and not only that of active transactions)
can be justified by a sequential execution. As active transactions
can only read from committed or speculatively committed
transactions, this implies that the only updates that can ever be
observed are the ones produced by transactions that reflect some
sequential history. Further, a transaction T can observe the
effects of a (speculatively committed or committed) transaction
T ’, only provided that T ′ does not conflict with any other
transaction T ′′, whose effects T has already observed so far.
In fact, if T were to observe the effects of two transactions
that conflict either directly or indirectly, it would be impossible
to include them both in the same sequential execution history
that should be used to justify the execution of T .
IV. THE SHETM PLATFORM
This section presents SHeTM (Speculative HeTM), an imple-
mentation of the HeTM abstraction that relies on speculation
to minimize the overheads of inter-device synchronization.
A. Architecture and programming model
SHeTM implements the proposed HeTM abstraction for
heterogeneous platforms composed by one or more cache-
coherent multi-core CPUs and a discrete GPU. SHeTM is
implemented in C and relies on the CUDA API to orchestrate
the execution of the GPU.
SHeTM maintains a full replica of the shared TM region,
which we call STMR, on both the CPU and GPU. At each
device, the execution of transactions is regulated by a local
TM library, which we also call guest libraries. SHeTM adopts
a modular software architecture that seeks to attain inter-
operability with generic TM implementations for CPU and
GPU. This feature is important, since supporting the integration
of arbitrary guest TM libraries allows to adapt the choice of the
TM implementation used on each device to the characteristics
4
of the application workload and the device. We discuss which
mechanisms SHeTM employs to integrate third-party guest TM
libraries, as well as the assumptions that these libraries need to
satisfy to correctly inter-operate with SHeTM, in Section IV-B.
Programming model. SHeTM offers a conventional TM
interface for demarcating (i.e., beginning, committing, aborting)
transactions and declaring read/write accesses to the STMR.
There are, however, relevant aspects related to the heteroge-
neous nature of the HeTM abstraction that programmers should
take into account when developing transactional applications
for SHeTM and that have influenced the design of SHeTM
programming interfaces.
A first observation is that the STMR’s replicas maintained
by the CPU and GPU may be mapped in different positions
in their address spaces. Thus, the management of pointers in
SHeTM raises issues analogous to the ones that affect other
implementations of the shared memory abstraction (e.g., in
POSIX mmap or SystemV shmem [51]), such as: if pointers
to a position within the STMR are stored in the STMR, they
should be expressed as relative offsets and not as absolute
addresses (and converted back to absolute addresses before
they are de-referenced); although SHeTM does not prevent
storing in the STMR pointers to memory regions external to
the STMR, it is responsibility of the application developers to
ensure that these pointers are only dereferenced by the device
in whose address space they are defined.
A second relevant observation is that architectural differences
of CPUs and (discrete) GPUs have a great impact on their
programming models and, as such, HeTM systems should
keep these aspects into account to attain high efficiency. One
key issue is that, differently from CPUs, where transactions
are typically executed individually, in GPUs it is desirable to
execute transactions in relatively large batches [50], [7], as this
allows for: i) amortizing the latency of transactions’ activation;
ii) enhancing throughput when transferring to/from the GPU
the inputs/output required/produced by transactions’ execution;
iii) improving resource utilization on modern GPUs.
To reconcile these differences, SHeTM abstracts over the
computational model of CPU and GPU via a thread pool model
in which each device exposes a number of worker threads.
Worker threads are the only entities that can directly access
the STMR, i.e., application threads that need to manipulate
(or access) STMR should do so by submitting transactional
requests to the worker threads (via SHeTM’s API).
SHeTM views each instance of a transaction as an abstract
operation that consumes an input and produces an output.
SHeTM is opaque to the structure of transactions’ inputs and
outputs, requiring only information on their size in order to
correctly transfer transactional requests/responses to/from the
worker threads. In order to support the efficient execution
of transactions on both CPUs and GPUs, SHeTM allows
programmers to associate each transaction with (at least one
of) two implementations: i) a “transactional function”, which
is meant to execute on the CPU and processes exactly one
transaction; ii) a “transactional kernel”, which is meant to
execute on the GPU and processes a batch of transactions of
arbitrary (but statically defined) size.
Developers of transactional kernels have the responsibility
to control which and how many threads to activate, how
many transactions each thread should execute, as well as how
transactional inputs should be consumed. SHeTM, in turn, is
responsible for activating transactional kernels, shipping to the
GPU the corresponding transactions’ inputs and retrieving the
transactions’ result to the host once the kernel ends.
Programmers are not obliged to provide two implementations
for a given transaction. If they do so, though, this provides
SHeTM with the flexibility to select the implementation/device
to use for executing a given transaction instance in a dynamic
fashion, using a work-stealing policy that aims to balance load
on both CPU and GPU.
Transaction scheduling and dispatching. For each registered
transaction, SHeTM allocates a number of request queues. The
number of queues that SHeTM allocates for a given transaction
depends on the number of implementations that were registered
for it. If a single implementation was defined (either for CPU
or for GPU), only a single queue is created, which is used to
store all the requests for that transaction. If implementations
for both the CPU and GPU are provided, instead, SHeTM
allocates three requests queues, noted CPUQ, GPUQ and
SHAREDQ. As their names suggest, the first two queues are
meant to buffer requests which were submitted for execution
on the CPU and GPU, respectively. This indication is passed
to SHeTM via the programming interface used to support the
submission of transactional requests, through which an optional
device-affinity parameter can be specified.
This mechanism allows SHeTM to exploit external knowl-
edge, e.g., provided by programmers or automatic tools (e.g.,
static code analysis [47] or on-line scheduling techniques [11],
[12]), on the conflict patterns between different transaction
instances and mitigate inter-device contention by dispatching
conflict-prone transactions to the same device (where contention
can be detected and managed more efficiently).
If, upon submission of a request, no device-affinity is
indicated (and both CPU and GPU implementations exist for
the corresponding transaction), then the request is routed to
the shared queue, which is accessible by both devices on the
basis of a work-stealing policy.
Note that the enqueued requests are consumed at different
granularity by the CPU and GPU. CPU worker threads process
requests individually, extracting them from the CPUQ queues
in a round-robin fashion, or, if all CPUQ queues are found
empty, from the shared queue. The processing of requests from
the GPUQ, and the activation of the corresponding transac-
tional kernel is coordinated by a management thread, which we
call GPU-controller, running on the CPU. This thread monitors
the GPUQ queues of the various transactions registered in
SHeTM and activates the corresponding transactional kernel
when any of these queues contains a sufficient number of
requests to feed the kernel.
5
B. Integration with guest TM libraries.
To guarantee the HeTM’s consistency semantics described in
Section III, SHeTM assumes that the underlying TMs ensure
opacity (or, more generically, any TM consistency criterion
that guarantees the properties P1 and P2 defined in Section III).
SHeTM abstracts over the internal logic of the guest TM
libraries and interfaces with them by exposing a simple
callback function that the guest TM should invoke, whenever a
transaction commits. In a nutshell, the callback allows SHeTM
to detect inter-device transaction conflicts and keep track of
the STMR updates produced at each device. The information
communicated, via the callback function, by the guest TM to
SHeTM differs for the CPU and GPU side.
CPU instrumentation. On the CPU side, upon commit
of a transaction, a guest TM library must provide as in-
put to SHeTM’s callback function an array containing the
< address, value, timestamp > of each memory position
updated by that transaction. The specified timestamp must
be usable by HeTM to totally order the updates to that
memory position and is easily provided both by software and
hardware TM implementations. For instance, most software
TM implementations, e.g., TinySTM [15] or NoREC [8],
use a logical timestamp to totally order the commits of all
transactions. With hardware TM implementations, such as
Intel TSX, the processor cycles (e.g., read via the RDTSCP()
instruction) can be used to determine the total order.
Gathering transactions’ write-sets imposes no additional
overhead to a guest STM, as STMs need anyway to track
the write-sets in software. For HTM, SHeTM requires the
software instrumentation of write operations to gather the
transaction’s write-set. However, in many realistic workloads,
writes are largely outnumbered by reads and, as such, the
resulting instrumentation overhead is small.
The HeTM’s callback function appends the write-sets into
thread-local data-structures, referred herein as the CPU write-
set logs, and periodically offloads them to the GPU to perform
inter-device conflict detection.
GPU instrumentation. On the GPU side, a guest TM library
must communicate to the HeTM’s callback the set of addresses
read and written by the committing transaction. In the CPU case,
the write-sets are accumulated in the per thread logs till inter-
device synchronization is activated. Conversely, on the GPU
side the read-set and write-set of a committing transactions
are used to update two bitmaps, noted RSGPUBMP and WS
GPU
BMP ,
that track the regions of the STMR that GPU transactions
read or wrote, respectively. After those bitmaps are updated,
the transaction’s read-set and write-set can be immediately
discarded. The necessity for this asymmetric instrumentation
logic at the CPU and GPU is further detailed in Section IV-C.
Additional assumptions. SHeTM needs to manipulate the state
of the STMR to merge the updates produced at both devices and
to cancel the effects of speculatively committed transactions
in case inter-device conflicts are detected. These updates are
performed in a non-transactional way, i.e., bypassing the APIs
of the guest TM library — ensuring that no transaction is
executing concurrently, to preserve consistency. This design
is safe under the assumption that any meta-data managed by
the guest TM libraries are maintained externally to the STMR,
e.g., in an disjoint memory region on the local device. This
assumption is met in practice by most TM implementations,
being valid for all existing HTM implementations (which
maintain their metadata in the processor’s caches) and for
all word-based STMs (where TM metadata must necessarily
be stored in a disjoint memory region to avoid interfering with
the application’s memory layout).
Supported libraries. Currently, SHeTM supports three TM
implementations: two on the CPU side – TinySTM [15] and
Intel’s TSX [29], implemented respectively in software and
hardware – and one on the GPU side, namely PR-STM [50].
C. Basic Algorithm
We start by describing a basic variant of the SHeTM’s
algorithm that serves a twofold purpose: i) it allows us to
simplify presentation, by explaining the design of SHeTM
in an incremental fashion; and ii) it exposes several sources
of inefficiency that we address in the following text. At this
stage, we will assume a fixed policy to deal with inter-device
contention that deterministically discards the transactions spec-
ulatively committed by the GPU. We discuss how to relax this
assumption and support policies that discard the transactions
speculatively committed by the CPU in Section IV-E.
SHeTM orchestrates the execution of GPU and CPU in
synchronization rounds, where each round is composed by
three phases: execution, validation, and merge (see Figure 1).
1) Execution Phase: In the execution phase, transactions are
extracted from the input queues and fed to the devices during
a user-tunable period. Transaction processing is executed in
an independent way at both devices, starting from a consistent
snapshot, i.e., an identical replica of the STMR at both
devices, and executing transactions in a speculative fashion:
the execution of transactions is regulated exclusively by the
local TM library, which only detects conflicts between local
transactions, avoiding any inter-device synchronization.
When a transaction request to commit, the unmodified
commit logic of the local TM library is used to atomically
propagate the transaction’s update to the local STMR replica.
This local commit event coincides with the speculative commit,
in the execution model assumed by the HeTM abstraction.
At this point the TM library invokes the call-back functions
exposed by SHeTM, as referred in Section IV-B. On the CPU,
the write-set of the transaction is appended to a per-thread
log. On the GPU side, the transaction’s read-set and write-
set are used to update RSGPUBMP and WS
GPU
BMP bitmaps. The
bitmaps encode the set of addresses read and written by every
transaction that speculatively committed during the execution
phase, and they are updated concurrently by the GPU threads
that are in charge of executing transactions.
As mentioned, the duration of execution phase is a user-
tunable parameter that allows to explore an interesting per-
formance trade-off, which will be studied in Section V-D.
Longer periods imply less frequent synchronizations, which
6
WSCPU ∩ RSGPU = ∅
W
S
G
P
U
TXs
 
TXs
 
W
S
CP
U
Validation
Phase
Execution
Phase
Merge
Phase
WSCPU ∩ RSGPU ≠ ∅
W
S
C
P
U
W
S
G
P
U
b
m
p
T
M
R
[
]
W
S
G
P
U
b
m
p
TXs
 
TXs
 Validation
Phase
Execution
Phase
Merge
Phase
Figure 1. A basic variant of SHeTM (The left diagram refers to the commit case and the right diagram to the abort case).
WSCPU ∩ RSGPU = ∅
W
S
G
P
U
Val Val
TMRW
E
a
r
ly
V
a
li
d
a
t
io
n
TXs
 
TMRS
W
S
CP
U
TXs
 
TXs
 Validation
Phase
Execution
Phase
Merge
Phase
WSCPU ∩ RSGPU ≠ ∅
Val Val
E
a
r
ly
V
a
li
d
a
t
io
n
TMRW
TMRS
TXs
 
TXs
 
W
S
CP
U
TXs
 Validation
Phase
Execution
Phase
Merge
Phase
Figure 2. Illustrating the behavior of SHeTM (The left diagram refers to the commit case and the right diagram to the abort case).
means lower overhead in case the synchronization is successful.
However, longer period of executions mean also that a larger
number of transactions are speculatively executed at both
devices, increasing the probability of inter-device contention —
thus leading to wasting more work (aborted transactions).
2) Validation Phase: The goal of this stage is to determine
whether there was any conflict between the transactions
processed by the CPU and the GPU during the execution
phase. We designed the logic of this phase on the basis of the
following key observations:
• As the local TM libraries are assumed to ensure opacity, the
behavior of the speculatively committed transactions at each
device is already guaranteed to be equivalent to a sequential
execution (although defined over different sets of transactions).
The set S of transactions speculatively committed during the
processing phase at a device D ∈ {CPU,GPU} can thus be
logically subsumed by a single, equivalent transactions, noted
TD, whose read-set and write-set is the union of the read-sets
and write-sets of the transactions in S. This observation allows
us to reduce the problem of inter-device conflict detection
among a number of speculatively committed transactions to
the problem of detecting conflicts between a pair of logical
equivalent transactions, TCPU and TGPU .
• Detecting conflicts between a pair of transactions can be
reduced to verifying intersections between their read-sets
and write-sets [42], [3]. This computation can be efficiently
parallelized using GPUs, especially if the sets are large, as it
is the case for TCPU and TGPU , which subsume of a number
of transactions.
The design of the SHeTM’s validation scheme, as well
as its instrumentation logic, were engineered, based on this
observation, so as to take full advantage of modern GPUs.
• In many realistic workloads, transactions read a much larger
number of memory positions than they write to [16]. As such,
the read-sets of TCPU and TGPU are likely to be much
larger than their corresponding write-sets. Motivated by this
observation, we designed the validation scheme in a way to
avoid transmitting the read-sets over the inter-connection bus.
Note that there are two possible orders in which TCPU
and TGPU may be serialized, namely TGPU → TCPU or
TCPU → TGPU . For the former order to be valid, none
of the writes generated by TGPU should be “missed” by
TCPU , i.e., WSGPU ∩ RSCPU = ∅ (where WSGPU and
RSCPU denote the write-set and read-set of TGPU and TCPU ,
respectively). The latter order, conversely, requires verifying
whether WSCPU ∩RSGPU = ∅.
Keeping into account that we intend to leverage the GPU to
perform validation, SHeTM opts for testing whether the CPU
transactions can be serialized before the GPU transactions,
i.e., TCPU → TGPU . In fact, the computation of WSCPU ∩
RSGPU = ∅ can be performed on the GPU side by shipping
only the write-sets of the CPU transactions — whereas, the
opposite serialization order (TGPU → TCPU ) would require
shipping to the GPU the read-sets of the CPU transactions to
verify whether WSGPU ∩RSCPU = ∅.
• In order to guarantee that the updates of the TCPU and TGPU
can be mutually exchanged and applied at each device, yielding
a state equivalent to the one produced by the schedule TCPU →
TGPU , it is theoretically necessary to exclude also the presence
of write-write conflicts, i.e., whether WSCPU ∩WSGPU = ∅.
This check, though, is necessary only if one assumes that
transactions can issue “blind writes”, i.e., writing to a memory
position without reading it. If blind writes can be excluded,
in fact, the write-set of TGPU is guaranteed to be a subset
of RSGPU (WSGPU ⊆ RSGPU ). As such, verifying that
WSGPU ∩RSCPU = ∅ implies that WSGPU ∩WSCPU = ∅.
In transactional systems blind writes are considered as quite
rare [3], [53]. Nonetheless, in SHeTM, we take a safe approach,
whose correctness does not hinge on the absence of blind writes,
but still allows for sparing the cost of detecting write-write
7
conflicts: we guarantee that WSGPU ⊆ RSGPU by tracking
the writes issued by GPU transactions, not only in the write-
set bitmap, but also in the read-set bitmap. Given that writes
are typically outnumbered by reads, the overhead incurred by
tracking the writes in two bitmaps is expected to be low.
Let us put all these pieces together and discuss how the
validation phase operates in a systematic fashion.
The validation phase starts by transferring the write-sets logs
gathered by each CPU thread to the GPU. The logs are streamed
in chunks to achieve high throughput and activate validation
kernels that operate at sufficient granularity to achieve high
utilization of GPU resources. A validation kernel on the GPU
takes as input a chunk of a log and operates as follows: For
each tuple < address, value, timestamp > in the input log,
a different thread checks whether the corresponding entry in
the GPU’s read-set bitmap is set — which indicates that some
of the transactions speculatively committed by the GPU during
the execution phase read that address.
If the CPU write is found to have invalidated the read-set
of TGPU the validation phase returns a negative outcome, but
it continues applying the full set of write-set logs sent by the
CPU. This ensures that, at the end of the validation phase, the
GPU’s STMR incorporates all the effects of TCPU . Thus, if
in the merge phase, the state of the GPU’s STMR needs to
be re-aligned to the current state of the CPU’s STMR, it is
sufficient to simply undo the effects of the TGPU .
If the CPU write does not invalidate the read-set of TGPU ,
the corresponding value stored in the CPU write-set log
is applied to the GPU’s STMR. This is performed non-
transactionally, since the GPU is not processing transactions
during the validation phase. However, since the CPU logs are
validated in arbitrary order on the GPU side, in the apply phase
it is necessary to verify if the version currently present in the
GPU’s STMR is not fresher than one that is being applied. To
this end, on the GPU, SHeTM maintains a timestamp array,
denoted as TS, which has an entry per word of the STMR
reserved to store the timestamps of the CPU writes applied
during the validation phase. During validation, GPU threads
consult the TS to determine whether the write being validated
reflects a more recent state than the one already present in
the STMR, executing the apply phase only if this is the case.
Note that since concurrent GPU threads may be validating
writes targeting the same address, the atomicity of the test
for freshness and the value application is ensured via a lock
implemented using the first bit of the corresponding TS entry.
3) Merge Phase: The merge phase ensures that the replicas
of the STMR at the CPU and at the GPU are consistent, before
starting the execution phase of the next synchronization round.
The way in which the states of the CPU and GPU are realigned
depends on the outcome of the validation phase.
If the validation phase is successful, i.e., no inter-device
conflicts are detected, the GPU’s replica of the STMR already
incorporates the updates of the transactions that speculatively
committed at both devices — recall that, during the validation
phase, the GPU also applies the CPU write-sets into its local
STMR replica. To this end, the GPU-controller thread fetches
the GPU’s write-set bitmap, which identifies the memory
regions updated by the transactions speculatively committed
by the GPU and activates the memory transfers to update in-
place the STMR’s replica on the CPU. In order to attain high
throughput in these device to host memory transfers, the write-
set bitmap tracks the updates on the GPU STMR via chunks of
coarse granularity, i.e., 16KB, a value that we experimentally
found to provide robust performance. To further reduce the
number of memory transfers, the GPU-controller groups the
updated chunks that are stored consecutively in STMR and
copies each group using a single device to host transfer.
If the validation phase fails, the state of the GPU’s STMR
is realigned to the state on the CPU side. To this end, the
GPU controller obtains, the GPU’s write-set bitmap. This time,
though, it transfers the CPU’s state over the chunks marked
as updated on the GPU’s write-set bitmap, thus undoing any
side-effect of the execution of transactions on the GPU side.
D. Optimizations
SHeTM integrates a number of additional mechanisms that
aim at tackling two main sources of inefficiency: the blocking
time (i.e., the period during which transaction processing is
blocked) due to inter-device synchronization, and the overhead
imposed in case of inter-device contention. We present these
techniques in the following text and illustrate them in Figure 2.
Inter-device synchronization. As illustrated in Figure 1, in
the basic algorithm presented in Section IV-C, transaction
processing is blocked throughout the validation and merge
phases both at the CPU and at the GPU. This is clearly
undesirable for efficiency reasons, especially if one considers
that, to reduce the likelihood of inter-device contention, it is
desirable to use relatively short execution phases.
SHeTM tackles this issue by integrating mechanisms aimed
at reducing the blocking time both at the CPU and at the GPU.
On the CPU side, during the validation phase, SHeTM
allows the worker threads to continue processing transactions
concurrently with the streaming of the logs accumulated during
the execution phase. The CPU execution blocking only occurs
when a very few log chunks are left to be offloaded to the GPU.
In practical settings, the speed at which logs can be transferred
is higher than that at which new logs can be produced by
the worker threads. Thus, this mechanism effectively overlaps
transaction processing at the CPU side with the log transfers
to the GPU, while generating a relatively little amount of
additional logs to validate for the GPU.
On the GPU side, at the end of the merge phase, the basic
algorithm blocks transaction processing while transferring to
the CPU the memory regions updated by the GPU. This is done
to ensure that the state of the GPU’s STMR is not corrupted
due to the execution of transactions while the device to host
transfer is ongoing. SHeTM tackles this problem by employing
a double buffering approach. At the start of the merge phase,
a shadow copy of the state of the GPU’s STMR is created,
via a device to device copy. This way, as soon as the shadow
copy is created, GPU transaction processing can immediately
8
resume, since the shadow copy is isolated from the updates
of transactions (which operate exclusively on the STMR) and
can be used to feed the device to host transfer.
Inter-device contention. As discussed in Section IV-A,
SHeTM’s API allows exploiting external knowledge on trans-
actions’ conflict patterns (via the device affinity specified at
transactions’ submission time) to control the dispatching of
transactions and reduce inter-device contention. Besides striving
to reduce the likelihood of inter-device contention, SHeTM
incorporates two additional mechanisms that aim at reducing
two sources of overhead when inter-device conflicts do occur:
• Wasted work. In the basic algorithm, conflicts are detected
only at the end of the execution phase. This leads to waste
large amounts of work at the GPU, if a conflict is detected in
the validation phase. We tackle this problem by introducing an
early validation scheme that periodically transfers the CPU’s
logs to the GPU, where they are validated (but not applied)
while transactions are concurrently processed on both devices.
As early validations are concurrent with transaction processing,
it is still necessary to validate all the write-set logs produced
during the execution phase in the validation phase. Yet, by
anticipating the detection of inter-device conflict, as we will
see in Section V, early-validation can provide significant gains
in contention prone workloads by reducing the time the GPU
spends performing computations that are eventually discarded.
• Rollback latency. Realigning the GPU’s state to that of the
CPU, in case of inter-device contention, imposes significant
overhead in the basic algorithm. Every memory region updated
by the GPU has to be copied from the CPU and, during
this transfer, transaction processing is blocked at both devices.
Fortunately, the availability of the shadow copy is of great
help in this case. Recall that the shadow copy reflects a
consistent state of the STMR, as at the beginning of the current
synchronization round. Thus, in order to align the shadow copy
to the current state of the CPU, it suffices to apply to it the
CPU’s write-set logs.
E. Additional Conflict Resolution Policies
The solution presented so far assumes a fixed conflict reso-
lution that deterministically aborts the speculatively committed
transactions on the GPU side in case of inter-device conflict.
Note that this property has the advantage of ensuring that the
effects of speculatively committed transactions on the CPU side
can be immediately considered as committed. Thus their results
can be externalized to application-level results without incurring
the latency of inter-device synchronization. This is a desirable
property in practice, as arguably the CPU is the preferred
device to execute latency sensitive transactions (considering
that the processing of GPU transactions is burdened by the
latency of kernel activation and result transfer).
Nonetheless, supporting alternative conflict resolution poli-
cies, i.e., that abort the transactions speculatively committed
by the CPU, may be useful, e.g., to avoid starving the GPU
or to favor the device that committed more transactions.
Extending the scheme described so far to favor the GPU
and discard the effects of the CPU transaction is, from an
algorithmic perspective, quite straightforward. During the
validation phase, applying the CPU write-sets is only done
conditionally to the successful outcome of the entire validation
procedure. If a conflict is detected, the apply phase is skipped
and a negative outcome is returned to the GPU-controller
thread. Using a technique analogous to the one presented in
Section IV-D, a shadow copy of the CPU’s STMR (created at
the beginning of the current execution phase) can be used to
undo the effects of the speculatively committed transactions
(on the CPU side). Finally, the memory regions updated by
the GPU can be applied (in chunks) to the CPU’s STMR.
On the CPU side, a natural and lightweight way to create
a shadow copy of the STMR consists in forking the process
that hosts the worker threads. This approach, widely used
to implement efficient checkpointing scheme [35], [6], [37],
allows to exploit the efficient Copy-On-Write mechanism of
the OS and avoid synchronous memory copies. The current
prototype of SHeTM mainly supports a contention management
mechanism that avoids starvation of the GPU: if the GPU is
aborted more than a predetermined number of times, it only
allows read-only transactions to execute on the CPU-side in
the subsequent execution phase, postponing the execution of
enqueued update transactions until the next round. It is easy
to observe that the absence of update transactions at the CPU
guarantees the successful validation in the next round.
V. EVALUATION
This section presents an experimental study that aims at
answering the following key questions:
• What are the costs imposed by the instrumentation of
guests TM libraries? (Sec. V-A)
• What overheads does SHeTM introduce in workloads
whose scalability is not limited by inter-device contention?
(Sec. V-B)
• How sensitive is the performance of SHeTM to inter-
device contention? (Sec. V-C)
• How large are the gains that SHeTM’s optimizations
enable over simpler designs? (Sec. V-B and Sec. V-C)
• How effective is SHeTM with realistic applications?
(Sec. V-D)
Our evaluation is conducted using an Intel Xeon E5-2648L
v4 CPU equipped with an Nvidia GTX 1080 discrete-GPU.
This CPU supports Intel’s HTM implementation, called TSX.
The operating system is Ubuntu 16.04.3 LTS (kernel 4.4.0-
57). The Nvidia driver’s version is 387.34 and the CUDA
framework’s version is 9.1.
We based our evaluation on a set of synthetic benchmarks
conceived to assess different aspects of SHeTM’s design and
on MemcahedGPU [25].
In all the tests, we use 8 worker threads on the CPU side.
As for the transactional kernels, we tuned their configuration
(number of transactions per kernel activation, active threads
and thread blocks) on the basis of preliminary evaluations to
maximize the throughput of the device. The synthetic workloads
9
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 10  20  30  40  50  60  70  80  90T
hr
ou
gh
pu
t (n
orm
ali
ze
d)
SHeTMPR-STM (large bmp, W1)
SHeTMPR-STM (small bmp, W1)
SHeTMPR-STM (large bmp, W2)
SHeTMPR-STM (small bmp, W2)
 10  20  30  40  50  60  70  80  90
Write Transactions (%)
SHeTMTSX (W1)
SHeTMTSX (W2)
SHeTMTinySTM (W1)
SHeTMTinySTM (W2)
Figure 3. Cost of instrumentation of guest TM libraries.
use the same transactional logic on both the CPU and GPU and
operate on a STMR of size 600MB, unless otherwise specified.
A. Instrumentation Costs
Let us start by assessing the overhead induced by the software
instrumentation that SHeTM requires for its guest TM libraries.
To this end we consider two workloads, noted W1 and W2,
that access the STMR uniformly at random. In W1, read-only
transactions issue 4 reads, whereas update transactions read and
update 4 memory positions. W2 is identical to W1, except that
both transaction types issue 40, and not 4 reads. W1 is designed
to stress the instrumentation of read and write operations. W2
is selected as representative of many realistic workloads, in
which reads outnumber the writes.
In the plot in Figure 3 we vary on the x-axis the percentage
of the update transactions from 10% to 90% and report on
the y-axis the throughput normalized w.r.t. un-instrumented
versions PR-STM, for the GPU (left plot), and of TinySTM
and TSX, for the CPU (right plot).
In the right plot (GPU), we consider using two different
tracking granularities for the read-set bitmap (RSGPUBMP ), namely
4 bytes and 1KB. We can see that, independently of the
considered workload, the use of the small granularity bitmaps
induce, larger overheads, approx. 20%, as its larger size leads
to a lower locality of reference. The use of a coarser granularity,
in contrast, allows to reduce significantly the instrumentation
overhead, to approx. 5%, at the cost, though, of spurious aborts
due to the risk of false positives in the conflict detection scheme.
As a matter of fact, the trade-off between instrumentation
overhead and access tracking granularity is well known in the
literature, e.g., TM [15].
In the left plot (CPU), we observe that the instrumentation
cost is on average around 5% for W2 for both TinySTM and
TSX. In all scenarios, the overhead is below 10% except for the
most write intensive variants of W1, where it remains anyway
below 20% even in presence of 90% of update transactions.
B. Efficiency in absence of inter-device contention
Next, we intend to assess which overheads SHeTM incurs
in workloads whose scalability is not limited by inter-device
contention. Here, we consider two variants of the W1 workload,
generating 100% (W1-100%) and 10% (W1-10%) update
transactions, respectively.
We avoid inter-device contention by partitioning the STMR
in two halves and restricting CPU and GPU to access a different
half. The results of this study are reported in Figure 4, in which
we vary on the x-axis the duration of the execution phase from
0
2
4
6
8
10
12
14
16
 0  100
 200
 300
 400
 500
 600
Th
ro
ug
hp
ut
 (M
TX
/s)
Execution Phase (msec)
CPU-only
GPU-only
SHeTMbasic
SHeTM
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
 0  50
 100
 150
 200
 250
 300
 350
 400
Th
ro
ug
hp
ut
 (M
TX
/s)
Execution Phase (msec)
Figure 4. Efficiency in absence on contention. Left plot: 100% update
transactions. Right plot: 10% update transactions.
0
20
40
60
80
100
20 40 80 120
200
300
400
500
600
20 40 80 120
200
300
400
500
600
SHeTMbasic CPU SHeTM
%
 T
im
e
Idle
Non-blocking
Processing
0
20
40
60
80
100
20 40 80 120
200
300
400
500
600
20 40 80 120
200
300
400
500
600
SHeTMbasic GPU SHeTM
%
 T
im
e
Execution Phase (msec)
Validation
DtH
Processing
Figure 5. Break-down of exec. times (100% update transactions)
1 msec to 600 msec and report on the y-axis the throughput
of SHeTM and of the following baselines: the basic variant
of SHeTM presented in Section IV-C, noted SHeTM basic;
TSX running solo, noted CPU-only; PR-STM running solo and
copying its STMR to the host, after executing a kernel, using
double buffer (i.e., without blocking), noted GPU-only.
The throughput plot on the left, which refers to W1-100%,
shows that as the execution period grows the performance of
SHeTM also increases — as expected, since the relative amount
of time spent performing the validation and merge phases
reduces, amortizing their cost over larger period of useful
processing (see right plot of Figure 4). The peak throughput
of approx. 17M tx/sec, is reached at 200 msecs and plateaus
beyond that value. SHeTM’s peak throughput is about 55%
higher than the peak throughput of CPU-only and GPU-only
(approx. 11 M tx/sec) and only 23% lower than the throughput
of an idealized system that could total the combined throughput
of both uninstrumented devices.
By contrasting the performance of SHeTM with that of basic
we can clearly appreciate the performance gains enabled by the
optimizations described in Section IV-D, which are particularly
significant with small execution periods (up to +56% higher
throughput when the execution period lasts 1 msec). The bar
plots in Figure 5, which report the breakdown of times spent
by the CPU and GPU in various phases, allow us to derive
additional insights on the sources of these gains. The use of
double buffering on the GPU side to overlap kernel processing
with the device to host transfer in the merge phase is the largest
source of gains and, despite the device to device cost has a
relatively larger cost for the smallest execution periods, the
gains it enables largely outweigh the costs it imposes. On the
CPU side, the ability to overlap transaction processing (noted
non-blocking in the figure) with the shipping of logs to the
GPU has also a meaningful impact on reducing the blocking
10
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
 0  20
 40
 60
 80
 100
Th
ro
ug
hp
ut
 (w
rt C
PU
)
Probability of Conflict (%)
GPU-only
SHeTM
SHeTMno early val
Figure 6. Sensitivity to inter-device contention.
time, although not as strong as on the GPU side.
Finally, let us analyze the results reported in the right plot
of Figure 4, which refers to the workload with 10% of update
transactions. In this scenario, which considers a less extreme
(and arguably more realistic) application workload we can see
that the peak throughput of SHeTM converges to 4, which ,is
very close to the peak throughput achieved by an idealized
solution that achieves a performance equal to that of the
two device — an additional evidence of the efficiency of
the proposed design.
C. Sensitivity to contention
We now consider the same workload as in the previous study,
but inject with a given probability a conflicting access at random
in the stream of writes generated by the CPU transactions.
We vary on the x-axis the inter-device conflict probability, fix
the duration of the execution phase at 80 msecs and compare, in
Figure 6, the performance of SHeTM with and without the early
validation mechanism. On the y-axis we report the throughput
normalized with respect to TSX (unistrumented) running solo
and report, as reference, also the throughput achieved using
PR-STM, running solo with double buffering.
The analysis of this plots reveals several insights. The first
observation is that SHeTM consistently outperforms both TSX
and PRSTM for abort rates as high as 80%. In medium con-
tention, e.g., 50% probability of contention, SHeTM continues
to deliver a 40% gain over the fastest individual device (CPU).
Even when operating at the extreme 100% abort rate it incurs
only a modest overhead (approx. 20% if the early validation
is disabled. Overall, these results confirm the robustness of
SHeTM performance even in adverse scenarios.
Early validation appears to be a powerful mechanism
to mitigate overhead, especially in medium-high contention
scenario (60% and 80% abort rate). The only exception is
the case of 100% inter-device contention: in such an extreme
(and arguably non-representative of the desirable operational
region of HeTM or of any other TM systems) scenario, early
validation fails constantly, triggering the completion of the
current execution phase and device transfer of the CPU logs.
This is logically equivalent to operate with a much shorter
execution phase, which, as seen in Figure 4, tends to induce
longer blocking periods of the CPU.
D. MemcachedGPU
As mentioned, MemcachedGPU extends Memcached, a
popular in-memory object caching system, in order to use GPUs
to serve lookup requests for cached objects (GET operations).
0.0
0.5
0.8
1.0
1.2
1.5
1.8
 0  5  10  15  20  25
Th
ro
ug
hp
ut
 (w
rt C
PU
)
Concurrent Execution (ms)
GPU-only
SHeTM no-conflicts
SHeTM steal 20%
SHeTM steal 80%
SHeTM steal 100%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
 0  5  10  15  20  25
Pr
ob
ab
ilit
y 
of
 C
om
m
it
Concurrent Execution (ms)
Figure 7. Throughput of HeTM for Memcached with possible conflicts.
The original implementation MemcachedGPU does not
integrate a TM. As such, its developers had to implement an
ad-hoc synchronization mechanism to propagate the effects of
updates to the cache (e.g., via PUT operations) to the contents
maintained in the GPU’s cache. Besides being non-trivial, the
synchronization scheme used in the original MemcachedGPU
system suffers of a notable limitation. PUT operations on the
GPU kernel need to be executed in a single threaded fashion
and blocking any other concurrent GET operation. Both these
problems can be avoided thanks to the HeTM abstraction, which
we use to transparently keep the cache’s state synchronized
both on the CPU and GPU and to support (via its guest TM
library, PR-STM) the concurrent execution of state-changing
operations on both devices.
In this experiment, we use a cache with 1000000 sets, which
corresponds to a size of approx. 480MB. The sets are 8-way
associative, and the size of the key is 16 bytes while that of
the value is 32 bytes. We use LRU as replacement policy in
case of eviction. The workload is composed by 99.9% of GETs
and the object popularity follows a Zipfian distribution with
parameter α = 0.5 — which represent typical is a common
distribution when settings for evaluating caches [5].
We consider 4 different workloads, defined as follows. In
the first workload (no-conflicts), we balance the load (i.e.,
cache operations) in input to the GPU and CPU by using the
last bit of the key accessed by an operation. This guarantees
that the input queues of the CPU and GPU can never contain
operations that access a common key, excluding the possibility
of inter-device contention.
We then emulate load unbalances scenarios, in which the
GPU receives progressively less input operation (e.g., due to
shifts in keys’ popularity) and starts stealing requests from
the CPU queue with increasing probability (steal 20% and
steal 80%). We consider also the extreme scenario in which
no device affinity is set to mitigate contention, so that both
devices access the same set of keys (steal 100%).
Note that in this case the peak normalized throughput
achievable by an ideal solution that incurs no overhead and
totals the equivalent normalized throughput of both CPU-only
and GPU-only is of approx. 1.9.
The plot shows that SHeTM achieves almost indistinguish-
able performance in the no-conflict and the 20% conflicts
scenarios, being in both cases less than 20% away from
the ideal solution and 80% better than both GPU-only and
11
CPU-only. The gains remain significant even in case the
GPU steals operations from the CPU queue with 80% of
probability (20% to 40% speed-up over CPU-only). Finally, it
is worth highlighting that, even when the contention-avoidance
dispatching mechanisms of SHeTM are not used and contention
(steal 100%), SHeTM achieves robust performance with speed-
ups of up to approx. 30% and performance on par with CPU-
only even when, due to the use of overly large batch sizes, the
inter-device conflict probability (right plot) converges to 1.
VI. CONCLUSIONS AND FUTURE WORK
This work introduced the abstraction of Heteregeneous
Transactional Memory (HeTM). HeTM aims to facilitate
programming of heterogeneous platforms, by abstracting the
difficulties of data sharing across multiple physically separated
units via the illusion of a single transaction memory shared
among CPUs and (discrete) GPU(s).
Besides introducing the abstract semantics and programming
model of HeTM, we presented an efficient, yet modular,
implementation of the proposed abstraction, named Speculative
HeTM (SHeTM). We demonstrated the efficiency of the
SHeTM via an extensive quantitative study based both on
synthetic benchmarks and on a porting of a popular object
caching system.
We argue that this work opens a number of novel research
questions related to defining alternative semantics and designs
for the HeTM abstraction. A specific question that we intend to
investigate in the future is how to extend SHeTM to orchestrate
the execution of multiple GPUs.
12
REFERENCES
[1] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, “StarPU:
a unified platform for task scheduling on heterogeneous multicore
architectures,” Concurrency and Computation: Practice and Experience,
vol. 23, no. 2, pp. 187–198, 2011.
[2] W. Baek, N. Bronson, C. Kozyrakis, and K. Olukotun, “Implementing
and Evaluating Nested Parallel Transactions in Software Transactional
Memory,” in Proceedings of the Twenty-second Annual ACM Symposium
on Parallelism in Algorithms and Architectures, ser. SPAA ’10. New
York, NY, USA: ACM, 2010, pp. 253–262.
[3] P. A. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control
and Recovery in Database Systems. Boston, MA, USA: Addison-Wesley
Longman Publishing Co., Inc., 1986.
[4] O. A. R. Board et al., “OpenMP Application Program Interface,” version
4.0, 2013.
[5] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker, “Web caching
and Zipf-like distributions: evidence and implications,” in INFOCOM,
vol. 1. New York, NY, USA: IEEE, March 1999, pp. 126–134.
[6] D. Castro, P. Romano, and J. Barreto, “Hardware Transactional Memory
meets Memory Persistency,” in IPDPS. New York, NY, USA: IEEE,
2018, pp. 368–377.
[7] N. Corporation, “CUDA C Programming Guide,” https://docs.nvidia.com/
cuda/cuda-c-programming-guide/, 2015.
[8] L. Dalessandro, M. F. Spear, and M. L. Scott, “NOrec: streamlining
STM by abolishing ownership records,” in PPoPP, ACM. Bangalore,
India: ACM, 2010, pp. 67–78.
[9] P. Damron, A. Fedorova, Y. Lev, V. Luchangco, M. Moir, and D. Nuss-
baum, “Hybrid Transactional Memory,” SIGOPS Oper. Syst. Rev., vol. 40,
no. 5, pp. 336–346, Oct. 2006.
[10] D. Didona, N. Diegues, A.-M. Kermarrec, R. Guerraoui, R. Neves, and
P. Romano, “ProteusTM: Abstraction Meets Performance in Transactional
Memory,” SIGOPS Oper. Syst. Rev., vol. 50, no. 2, pp. 757–771, Mar.
2016.
[11] N. Diegues, P. Romano, and S. Garbatov, “Seer: Probabilistic Scheduling
for Hardware Transactional Memory,” in SPAA, ser. SPAA ’15. New
York, NY, USA: ACM, 2015, pp. 224–233.
[12] A. Dragojevic´, R. Guerraoui, A. V. Singh, and V. Singh, “Preventing
Versus Curing: Avoiding Conflicts in Transactional Memories,” in
Proceedings of the 28th ACM Symposium on Principles of Distributed
Computing, ser. PODC ’09. New York, NY, USA: ACM, 2009, pp.
7–16.
[13] A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Martorell,
and J. Planas, “Ompss: a proposal for programming heterogeneous multi-
core architectures,” Parallel processing letters, vol. 21, no. 02, pp. 173–
193, 2011.
[14] R. Farber, Parallel Programming with OpenACC. Morgan Kaufmann,
2016.
[15] P. Felber, C. Fetzer, P. Marlier, and T. Riegel, “Time-Based Software
Transactional Memory,” IEEE Transactions on Parallel and Distributed
Systems, vol. 21, pp. 1793–1807, 2010.
[16] P. Felber, S. Issa, A. Matveev, and P. Romano, “Hardware Read-write
Lock Elision,” in Proceedings of the Eleventh European Conference on
Computer Systems, ser. EuroSys ’16. New York, NY, USA: ACM, 2016,
pp. 34:1–34:15.
[17] R. M. Fujimoto, “Parallel Discrete Event Simulation,” Commun. ACM,
vol. 33, no. 10, pp. 30–53, Oct. 1990.
[18] W. W. L. Fung, I. Singh, A. Brownsword, and T. Aamodt, “Kilo TM:
Hardware transactional memory for GPU architectures,” IEEE Micro,
vol. 32, no. 3, pp. 7–16, 2012.
[19] K. O. W. Group et al., “The opencl specification,” version, vol. 1, no. 29,
p. 8, 2008.
[20] O. W. Group et al., “OpenACC specification,” version 2.7, 2018.
[21] R. Guerraoui and M. Kapalka, “On the Correctness of Transactional
Memory,” in Proceedings of the 13th ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming, ser. PPoPP ’08. New
York, NY, USA: ACM, 2008, pp. 175–184.
[22] M. Harris, “Unified Memory in CUDA 6,” Nov. 2013. [Online].
Available: https://devblogs.nvidia.com/unified-memory-in-cuda-6/
[23] T. Harris, S. Marlow, S. Peyton-Jones, and M. Herlihy, “Composable
Memory Transactions,” in Proceedings of the Tenth ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming, ser.
PPoPP ’05. New York, NY, USA: ACM, 2005, pp. 48–60.
[24] M. Herlihy and J. E. B. Moss, “Transactional Memory: Architectural
Support for Lock-free Data Structures,” SIGARCH Comput. Archit. News,
vol. 21, no. 2, pp. 289–300, May 1993.
[25] T. H. Hetherington, M. O’Connor, and T. M. Aamodt, “MemcachedGPU:
Scaling-up Scale-out Key-value Stores,” in Proceedings of the Sixth ACM
Symposium on Cloud Computing, ser. SoCC ’15. New York, NY, USA:
ACM, 2015, pp. 43–57.
[26] P. Hijma, C. J. Jacobs, R. V. van Nieuwpoort, and H. E. Bal, “Cashmere:
Heterogeneous many-core computing,” in 2015 IEEE International
Parallel and Distributed Processing Symposium. IEEE, 2015, pp. 135–
145.
[27] A. Holey and A. Zhai, “Lightweight Software Transactions on GPUs,”
in ICPP. New York, NY, USA: IEEE, 2014, pp. 461–470.
[28] D. Imbs and M. Raynal, “Virtual World Consistency: A Condition for
STM Systems (with a Versatile Protocol with Invisible Read Operations),”
Theor. Comput. Sci., vol. 444, pp. 113–127, Jul. 2012.
[29] Intel Corporation, “Desktop 4th Generation Intel Core Processor Family
(Revision 028),” Intel Corporation, Tech. Rep., 2015.
[30] T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and
D. I. August, “Automatic CPU-GPU communication management and
optimization,” in ACM SIGPLAN Notices, vol. 46, no. 6. ACM, 2011,
pp. 142–151.
[31] F. Ji, H. Lin, and X. Ma, “RSVM: a region-based software virtual
memory for GPU,” in Proceedings of the 22nd international conference
on Parallel architectures and compilation techniques. IEEE, 2013, pp.
269–278.
[32] N. Jouppi, “Google supercharges machine learning
tasks with TPU custom chip,” May 2016. [On-
line]. Available: https://cloudplatform.googleblog.com/2016/05/Google-
supercharges-machine-learning-tasks-with-custom-chip.html
[33] P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel, “TreadMarks:
Distributed Shared Memory on Standard Workstations and Operating
Systems,” in Proceedings of the USENIX Winter 1994 Technical
Conference on USENIX Winter 1994 Technical Conference, ser. WTEC’94.
Berkeley, CA, USA: USENIX Association, 1994, pp. 10–10.
[34] K. Li and P. Hudak, “Memory Coherence in Shared Virtual Memory
Systems,” ACM Trans. Comput. Syst., vol. 7, no. 4, pp. 321–359, Nov.
1989.
[35] M. Liu, M. Zhang, K. Chen, X. Qian, Y. Wu, W. Zheng, and J. Ren,
“DudeTM: Building Durable Transactions with Decoupling for Persistent
Memory,” SIGOPS Oper. Syst. Rev., vol. 51, no. 2, pp. 329–343, Apr.
2017.
[36] M. Martin, C. Blundell, and E. Lewis, “Subtleties of Transactional
Memory Atomicity Semantics,” IEEE Comput. Archit. Lett., vol. 5, no. 2,
pp. 17–17, Jul. 2006.
[37] A. Memaripour, A. Badam, A. Phanishayee, Y. Zhou, R. Alagappan,
K. Strauss, and S. Swanson, “Atomic In-place Updates for Non-volatile
Main Memories with Kamino-Tx,” in Proceedings of the Twelfth
European Conference on Computer Systems, ser. EuroSys ’17. New
York, NY, USA: ACM, 2017, pp. 499–512.
[38] S. Mittal and J. S. Vetter, “A Survey of CPU-GPU Heterogeneous
Computing Techniques,” ACM Comput. Surv., vol. 47, no. 4, pp. 69:1–
69:35, Jul. 2015.
[39] S. Momcilovic, A. Ilic, N. Roma, and L. Sousa, “Dynamic Load
Balancing for Real-Time Video Encoding on Heterogeneous CPU+GPU
Systems,” IEEE Transactions on Multimedia, vol. 16, no. 1, pp. 108–121,
Jan 2014.
[40] T. Nakaike, R. Odaira, M. Gaudet, M. M. Michael, and H. Tomari,
“Quantitative Comparison of Hardware Transactional Memory for Blue
Gene/Q, zEnterprise EC12, Intel Core, and POWER8,” in Proceedings
of the 42Nd Annual International Symposium on Computer Architecture,
ser. ISCA ’15. New York, NY, USA: ACM, 2015, pp. 144–157.
[41] V. Pankratius and A.-R. Adl-Tabatabai, “A Study of Transactional
Memory vs. Locks in Practice,” in Proceedings of the Twenty-third
Annual ACM Symposium on Parallelism in Algorithms and Architectures,
ser. SPAA ’11. New York, NY, USA: ACM, 2011, pp. 43–52.
[42] C. H. Papadimitriou, “The Serializability of Concurrent Database
Updates,” J. ACM, vol. 26, no. 4, pp. 631–653, Oct. 1979.
[43] PCI-SIG, “PCI Express (Peripheral Component Interconnect Express),
PCIe Specification,” 2019. [Online]. Available: http://pcisig.com/
[44] A. Pellegrini, R. Vitali, and F. Quaglia, “The ROme OpTimistic Simulator:
Core Internals and Programming Model,” in Proceedings of the 4th
International ICST Conference on Simulation Tools and Techniques, ser.
SIMUTools ’11. ICST, Brussels, Belgium, Belgium: ICST (Institute
13
for Computer Sciences, Social-Informatics and Telecommunications
Engineering), 2011, pp. 96–98.
[45] S. Peluso, J. Fernandes, P. Romano, F. Quaglia, and L. Rodrigues,
“SPECULA: Speculative Replication of Software Transactional Memory,”
in Proceedings of the 2012 IEEE 31st Symposium on Reliable Distributed
Systems, ser. SRDS ’12. Washington, DC, USA: IEEE Computer Society,
2012, pp. 91–100.
[46] T. Riegel, C. Fetzer, and P. Felber, “Time-based Transactional Memory
with Scalable Time Bases,” in Proceedings of the Nineteenth Annual
ACM Symposium on Parallel Algorithms and Architectures, ser. SPAA
’07. New York, NY, USA: ACM, 2007, pp. 221–228.
[47] ——, “Automatic data partitioning in software transactional memories,”
in Proceedings of the Twentieth Annual Symposium on Parallelism in
Algorithms and Architectures. New York, NY, USA: ACM, 2008, pp.
152–159.
[48] P. Romano, R. Palmieri, F. Quaglia, N. Carvalho, and L. Rodrigues, “On
Speculative Replication of Transactional Systems,” J. Comput. Syst. Sci.,
vol. 80, no. 1, pp. 257–276, Feb. 2014.
[49] N. Shavit and D. Touitou, “Software Transactional Memory,” in Pro-
ceedings of the Fourteenth Annual ACM Symposium on Principles of
Distributed Computing, ser. PODC ’95. New York, NY, USA: ACM,
1995, pp. 204–213.
[50] Q. Shen, C. Sharp, W. Blewitt, G. Ushaw, and G. Morgan, “PR-STM:
Priority Rule Based Software Transactions for the GPU,” in Euro-Par
2015: Parallel Processing. Springer, 2015, pp. 361–372.
[51] A. Silberschatz, P. B. Galvin, and G. Gagne, Operating System Concepts,
9th ed. Wiley Publishing, 2012.
[52] The Linux kernel development community, “Heterogeneous Memory
Management (HMM),” 2019. [Online]. Available: https://www.kernel.org/
doc/html/latest/vm/hmm.html
[53] J. D. Ullman, Principles of Database Systems, 2nd ed. New York, NY,
USA: W. H. Freeman & Co., 1983.
[54] A. Villegas, A. Navarro, R. Asenjo, and O. Plata, “Lightweight Software
Transactions on GPUs,” Supercomputing, 2018.
[55] A. Villegas and R. Ubal, “Stretching transactional memory,” in TRANS-
ACT 2016 - 11th ACM SIGPLAN Workshop on Transactional Computing.
ACM, 2015.
[56] Q. Wang, S. Kulkarni, J. Cavazos, and M. Spear, “A Transactional
Memory with Automatic Performance Tuning,” ACM Trans. Archit. Code
Optim., vol. 8, no. 4, pp. 54:1–54:23, Jan. 2012.
[57] Y. Xu, R. Wang, N. Goswami, T. Li, L. Gao, and D. Qian, “Software
transactional memory for gpu architectures,” in Proceedings of Annual
IEEE/ACM International Symposium on Code Generation and Optimiza-
tion. New York, NY, USA: ACM, 2014, p. 1.
[58] R. M. Yoo, C. J. Hughes, K. Lai, and R. Rajwar, “Performance
Evaluation of Intel® Transactional Synchronization Extensions for High-
Performance Computing,” in Proceedings of the International Conference
on High Performance Computing, Networking, Storage and Analysis, ser.
SC ’13. New York, NY, USA: ACM, 2013, pp. 19:1–19:11.
[59] Y. Yuan, M. F. Salmi, Y. Huai, K. Wang, R. Lee, and X. Zhang, “Spark-
GPU: An accelerated in-memory data processing engine on clusters,” in
2016 IEEE International Conference on Big Data (Big Data). IEEE,
2016, pp. 273–283.
[60] J. Zeng, J. Barreto, S. Haridi, L. Rodrigues, and P. Romano, “The Fu-
ture(s) of Transactional Memory,” in 2016 45th International Conference
on Parallel Processing (ICPP), Aug 2016, pp. 442–451.
[61] Z. Zhong, V. Rychkov, and A. Lastovetsky, “Data partitioning on
multicore and multi-GPU platforms using functional performance models,”
IEEE Transactions on Computers, vol. 64, no. 9, pp. 2506–2518, 2015.
14
