Active Access: A Mechanism for High-Performance Distributed Data-Centric
  Computations by Besta, Maciej & Hoefler, Torsten
ar
X
iv
:1
91
0.
12
89
7v
1 
 [c
s.D
C]
  2
8 O
ct 
20
19
Active Access: A Mechanism for High-Performance Distributed
Data-Centric Computations
Maciej Besta
Department of Computer Science
ETH Zurich
maciej.besta@inf.ethz.ch
Torsten Hoefler
Department of Computer Science
ETH Zurich
htor@inf.ethz.ch
ABSTRACT
Remote memory access (RMA) is an emerging high-performance
programming model that uses RDMA hardware directly. Yet, ac-
cessing remote memories cannot invoke activities at the target
which complicates implementation and limits performance of data-
centric algorithms. We propose Active Access (AA), a mechanism
that integrates well-known active messaging (AM) semantics with
RMA to enable high-performance distributed data-centric compu-
tations. AA supports a new programming model where the user
specifies handlers that are triggered when incoming puts and gets
reference designated addresses. AA is based on a set of extensions
to the Input/Output Memory Management Unit (IOMMU), a unit
that provides high-performance hardware support for remapping
I/O accesses to memory. We illustrate that AA outperforms exist-
ing AM and RMA designs, accelerates various codes such as dis-
tributed hashtables or logging schemes, and enables new protocols
such as incremental checkpointing for RMA. We also discuss how
extended IOMMUs can support a virtualized global address space
in a distributed system that offers features known from on-node
memory virtualization. We expect that AA can enhance the design
of HPC operating and runtime systems in large computing centers.
CCS CONCEPTS
• Networks → Network architectures; Programming inter-
faces;Network types; •Computer systems organization→Dis-
tributed architectures; Processors and memory architectures; •
Information systems → Distributed storage; Data management
systems; •Computingmethodologies→Distributed algorithms;
Distributed programming languages; • Hardware → Networking
hardware; Emerging architectures; Emerging tools and methodolo-
gies; Emerging interfaces; •Theoryof computation→Distributed
algorithms; • Software and its engineering → Designing soft-
ware;
This is an arXiv version of a paper
published at ACM ICS’15 under the same title
1 INTRODUCTION
Scaling on-chip parallelism alone cannot satisfy growing computa-
tional demands of datacenters and HPC centers with tens of thou-
sands of nodes [23]. Remote direct memory access (RDMA) [50], a
technology that completely removes the CPU and the OS from the
messaging path, enhances performance in such systems. RDMA
networking hardware gave rise to a new class of Remote Memory
Access (RMA) programming models that offer a Partitioned Global
Address Space (PGAS) abstraction to the programmer. Languages
such as Unified Parallel C (UPC) [57] or Fortran 2008 [35], and
libraries such as MPI-3 [41] or SHMEM implement the RMA prin-
ciples and enable direct one-sided low-overhead put and get access
to the memories of remote nodes, outperforming designs based on
the Message Passing (MP) model and the associated routines [29].
Active Messages (AMs) [59] are another scheme for improving
performance in distributed environments. An active message in-
vokes a handler at the receiver’s side and thus AMs can be viewed
as lightweight remote procedure calls (RPC). AMs are widely used
in a number of different areas (example libraries include IBM’s
DCMF, IBM’s PAMI, Myrinet Express (MX), GASNet [17], and
AM++ [60]). Unfortunately, AMs are limited to message passing
and cannot be directly used in RMA programming.
In this work we propose Active Access (AA), a mechanism that
enhances RMA with AM semantics. The core idea is that a remote
memory access triggers a user-definable CPU handler at the target.
As we explain in § 1.1, AA eliminates some of the performance
problems specific to RMA and RDMA.
Intercepting and processing puts or gets requires control logic
to identify memory accesses, to decide when and how to run a han-
dler, and to buffer necessary data. To preserve all RDMA benefits
in AA (e.f., OS-bypass, zero-copy, others), we propose a hardware-
based design that extends the input/output memory management
unit (IOMMU), a hardware unit that supports I/O virtualization.
IOMMUs evolved from simple DMA remapping devices to units of-
fering advanced hardware virtualization [9]. Still, AA shows that
many potential benefits of IOMMUs are yet to be explored. For ex-
ample, as we will later show (§ 3.5), moving the notification func-
tionality from the NIC to the IOMMU enables high performance
communication with the CPU for AA. Moreover, AA based on
IOMMUs can generalize the concept of virtual memory and enable
hardware-supported virtualization of networked memories with
enhanced data-centric paging capabilities.
In summary, our key contributions are as follows:
• We propose Active Access (AA), a mechanism that combines ac-
tive messages and RMA to improve the performance of RMA-
based applications and systems.
• We illustrate a detailed hardware design of simple extensions to
IOMMUs to construct AA.
• We show that AA enables a new data-centric programming
model that facilitates developing RMA applications.
• We evaluate AA using microbenchmarks and four large-scale
use-cases (a distributed hashtable, an access counter, a logging
system, and fault-tolerant parallel sort). We show that AA out-
performs other communication schemes.
• We discuss how the IOMMU could enable hardware-based vir-
tualization of remote memories.
1.1 Motivation
Consider a distributed hashtable (DHT): RMA programming im-
proves its performance in comparison to MP 2-10× [29]. Yet, hash
collisions impact performance as handling them requires to issue
many expensive remote atomics (see § 4). Figure 1 shows how the
performance varies by a factor of ≈10 with different collision rates.
0
50
100
0 250 500 750 1000
Processes
M
il
li
o
n
s
 i
n
s
e
rt
s
/s
e
c
o
n
d
C
ol
lis
io
ns
: 3
%
Co
lli
si
on
s:
 7
%
Co
llis
ion
s: 
14
%
Colli
sion
s: 25
%
Figure 1: Inserts/s in our
RMA hashtable (§ 1.1)
Figure 2: Comparison of the
IOMMU and the MMU (§ 2.2)
We will show later (§ 4) how AA reduces the number of remote
accesses from six to one. Intuitively, the design of AA, based on
IOMMU remapping logic, intercepts memory requests and passes
them for direct processing to the local CPU. Thus, AA combines
the benefits of AMs and OS-bypass in RMA communications.
2 BACKGROUND
We now briefly outline RMA programming. Then, we discuss the
parts of the IOMMU design (DMA remapping, IOMMU paging)
that we later use to design Active Access.
2.1 RMA Programming Models
RMA is a programming model in which processes communicate by
directly accessing one another’s memories. RMA is typically built
on OS-bypass RDMA hardware to achieve highest performance.
Thus, RMA put (writes to remotememories) and get (reads from re-
motememories) have very low latencies, and significantly improve
performance over MP [29]. RDMA is available in virtually all mod-
ern networks (e.g., IBM’s Cell on-chip network, InfiniBand [56],
IBM PERCS, iWARP, and RoCE). In addition, numerous existing
languages and libraries based on RMA such as UPC, Titanium, For-
tran 2008, X10, Chapel, or MPI-3.0 RMA are actively developed and
offer unique features for parallel programming. Consequently, the
number of applications in the RMA model is growing rapidly.
Here, we use source or target to refer to a process that issues or
is targeted by an RMA access. We always use sender and receiver
to refer to processes that exchange messages.
2.2 IOMMUs
IOMMUs are located between peripheral devices and main mem-
ory and can thus intercept any I/O traffic. Like the well-known
memory management units (MMUs), they can be programmed to
translate device addresses to physical host addresses. Figure 2 com-
pares MMUs and IOMMUs. An IOMMU can virtualize the view of
I/O devices and control access rights to memory pages.
All major hardware vendors such as IBM, Intel, AMD, Sun, and
ARM offer IOMMU implementations to support virtualized envi-
ronments; Table 1 provides an overview. We conclude that IOM-
MUs are a standard part of modern computer architecture and the
Vendor IOMMU and its application
AMD
GART [1]: address translation for the use by AGP
DEV [9]: memory protection
AMD IOMMU [2]: address translation & memory protection
IBM
Calgary PCI-X bridge [9]: address translation, isolation
DART [9]: address translation, validity tracking
IOMMU in Cell processors [9]: address translation, isolation
IOMMU in POWER5 [5]: hardware enhanced I/O virtualization
TCE [33]: enhancing I/O virtualization in pSeries 690 servers
Intel VT-d [34]: memory protection, address translation
ARM
CoreLink SMMU [40]: memory management in System-on-Chip
(SoC) bus masters, memory protection, address translation
PCI-SIG IOV & ATS [48]: address translation, memory protection
Sun IOMMU in SPARC [44]: address translation, memory protection
SolarFlare IOMMU in SF NICs [49]: address translation, memory protection
Table 1: An overview of existing IOMMUs (§ 2.2).
recent growth in virtualization for cloud computing ensures that
they will remain important in the future. However, IOMMUs are a
relatively new concept withmany unexplored opportunities. Their
ability to intercept any memory access and provide full address
space virtualization can be the basis for many novel mechanisms
for managing global (RDMA) address spaces.
To be as specific as possible, we selected Intel’s IOMMU technol-
ogy [34] to explain concepts of generic IOMMUs. Other implemen-
tations vary in some details but share the core features (per-page
protection, DMA remapping, etc.).
DMA Remapping DMA remapping is the IOMMU function
that we use extensively to design AA. The IOMMU remapping
logic allows any I/O device to be assigned to its own private sub-
set of host physical memory that is isolated from accesses by other
devices. To achieve this, IOMMUs utilize three types of remapping
structures (all located in main memory): root-entry tables, context-
entry tables, and IOMMU page tables. The first two are used to
map I/O devices to device-specific page tables. To improve the ac-
cess time, the remapping hardware maintains several caches such
as the context-cache (device-to-page-table mappings) and the I/O
Translation Lookaside Buffer (IOTLB) (translations from device ad-
dresses to host physical addresses).
Page Tables & Page Faults IOMMU page tables allow to man-
age host physical memory hierarchically; they are similar to stan-
dard MMU page tables (still, MMU and IOMMU page tables are
independent). A 4-level table allows 4KB page granularity on 64
bit machines (superpages of various sizes are also supported).
IOMMU page tables implement a page fault mechanism simi-
lar to MMUs. Every page table entry (PTE) contains two protec-
tion bits, W and R, which indicate whether the page is writable and
readable, respectively. Any access that violates the protection con-
ditions is blocked by the hardware and a page fault is generated,
logged, and reported to the OS. The IOMMU logs the fault infor-
mation using special registers and in-memory fault logs. The OS is
notified using Message Signal Interrupts (MSI). Every page fault is
logged as a fixed-sized fault entry that contains the fault metadata
(the address of the targeted page, etc.); the data being transferred
is discarded. We will later extend this mechanism to log active ac-
cesses and their data and to bypass the OS.
2
Figure 3: The overview of the IOMMU and the cooperating devices. The proposed extensions are marked with dashed edges and bold-italic
text. Solid circles with numbers ( - ) indicate the specific steps discussed in detail in § 3.1. Dashed circles ( - ) are extensions pointed
out in § 3.1-§ 3.7.
3 THE ACTIVE ACCESS MECHANISM
Active Access combines the benefits of RMA and AMs. AMs en-
hance the message passing model by allowing messages to ac-
tively integrate into the computation on the receiver side. In RMA,
processes communicate by accessing remote memories instead of
sending messages. Thus, an analogous scheme for RMA has to pro-
vide the active semantics for both types of remote operations; puts
and gets become active puts (AP) and active gets (AG), respectively.
Listing 1 shows the interface of AM and AA. An active mes-
sage sent to a process receiver_id carries arguments and
payload that will be used by a handler identified by a pointer
hlr_addr. In AA, the user issues puts and gets at trgt_addr
in the address space of a process trgt_id. No handler address is
specified. Instead, we enable the user to associate an arbitrary page
of data with a selected handler and with a set of additional actions
(discussed in § 3.3 and § 3.4). When a put or a get touches such a
page, it becomes active: first, it may or may not finalize its default
memory effects (depending on the specified actions); second, both
its metadata and data are ultimately processed by the associated
handler. AA is fully transparent to RMA and, as Listing 1 shows, it
entails no changes to the traditional interface.
1 /***************** interface of AM ****************/
2 void send_active_message(ptr hlr_addr, void* arguments, void*
payload, int receiver_id) { ... }
3 /***************** interface of AA ****************/
4 void put(void* trgt_addr, void* data, int trgt_id) {
5 /* Attempt to copy data from the local memory into
6 the memory location t rдt_addr of a process t rдt_id.*/
7 }
8 void get(void* trgt_addr, void* l_addr, int trgt_id) {
9 /* Attempt to fetch the data from the memory location
10 t rдt_addr of a process t rдt_id to l_addr .*/
11 }
12 void assoc_page(void* addr, void* act, int hlr_id) {
13 /* Associate a page at addr with actions act and
14 with a handler identified by the id hlr_id. */
Listing 1: Interface of Active Messages and Active Access (§ 3)
We now show how to extend IOMMUs to implement the above
AA interface and to enable active puts and gets. From now on, we
will focus on designs based on PCI Express (PCIe) [47]. We first
describe the interactions between an RDMA request and current
IOMMUs. The numbers in circles ( - ) refer to the correspond-
ing numbers in Figure 3.
3.1 State-of-the-art IOMMU Processing Path
Consider an RDMA put or a get that is issued by a remote pro-
cess. First, the local NIC receives the RDMA packet . The NIC
attempts to access the main memory with DMA and thus it gen-
erates appropriate PCIe packets (one or more depending on the
type of the PCIe transaction [47]) . Each packet is intercepted
by the IOMMU . First, the IOMMU resolves the mapping from
the device to its page table (using the packet header [34]). Here,
the IOMMU uses the context cache or, in case of a cache miss, it
walks the remapping tables - . Finally, the IOMMUobtains the
location of the specific page-table . The IOMMU then resolves
the mapping from a device address to a physical address using the
IOTLB or, in the event of a miss, the page-table . When it
finds the target PTE, it checks its protection bits W and R . The
next steps depend on the request type. For puts, if W=1, the value
is simply written to the target location. If W=0, the IOMMU raises
a page fault and does not modify the page. For gets, if R=1, the
request returns the accessed value to the NIC. If R=0, the IOMMU
raises a page fault and does not return the value.
Upon a page fault, the IOMMU tries to record the fault informa-
tion (fault entry) in the system-wide fault log (implemented as
an in-memory ring buffer). In case of an overflow (e.g., if the OS
does not process the recorded entries fast enough) the fault entry is
not recorded. If the fault entry is logged , the IOMMU interrupts
the CPU with MSI to run one of the specified handlers .
We now analyze the extensions that enable active puts/gets
(symbols - refer to the related symbols in Figure 3). Our goal
is to enable the IOMMU to multiplex intercepted accesses among
processes, buffer them in designated memory locations, and pass
them for processing to a CPU.
3.2 Processing the Intercepted Data
In the original IOMMU design the fault log is shared by all the
processes running on the node where the IOMMU resides; a po-
tential performance bottleneck. In addition, the IOMMU does not
enable multiplexing the data coming from the NIC across the pro-
cesses and handlers, limiting performance in multi/manycore en-
vironments. Finally, the data of a blocked RDMA put is lost as the
fault log entry only records the address (see steps - ). To allevi-
ate these issues, we propose to enhance the design of the IOMMU
3
(a) An active put (§ 3.3) (b) An active get (§ 3.4)
Figure 4: Active puts and gets. Here, the numbers in circles are independent of the numbering in Figure 3.
and its page tables to enable a data-centric multiplexing mecha-
nism in which the PTEs themselves guide the incoming requests to
be recorded in the specified logging data structures and processed by
the designated user-space handlers.
We first add a programmable field IOMMU User Domain ID
(IUID) to every IOMMU PTE . This field enables associating
pages with user domains. The OS and the NIC can ensure that
one IUID is associated with at most one local process, similarly
to Protection Domains in RDMA [50]. To add IUID we use bits
52-61 of IOMMU PTEs (ignored by the current IOMMU hard-
ware [34]). We can store 210 domains on each node; enough to
fully utilize, e.g., BlueGene/Q (64 hardware threads/node) or In-
tel Xeon Phi (256 hardware threads/chip). Second, our extended
IOMMU logs both the generated fault entry and the carried data
to the access log, a new in-memory circular ring buffer . A pro-
cess can have multiple private IUIDs/access logs located in its ad-
dress space . Third, the IOMMU maintains the access log ta-
ble , a simple internal associative data structure with tuples
(IUID,base,head,tail,size). One entry maps an IUID to
three physical addresses (the base, the head, and the tail pointer)
and the size of the respective access log ring buffer. The access log
table is implemented as content addressable memory (CAM) for
rapid access and it can be programmed in the same way as other
Intel IOMMU structures [34].
3.3 Controlling Active Puts (APs)
Active puts enable redirecting data coming from the NIC and/or re-
lated metadata to a specified access log. Figure 4a illustrates active
puts in more detail. Two additional PTE bits control the logging of
fault entry and data: WL (Write Log) and WLD (Write Log Data) .
If WL=1 then the IOMMU logs the fault entry for the written page
and if WLD=1 then the IOMMU logs both the fault entry and the
data. The flags W, WL and WLD are independent. For example, an ac-
tive put page is marked as W=0, WL=1, WLD=1. The standard way,
in which IOMMUs manage faults triggered by writes, is defined by
the values W=0, WL=1, WLD=0.
3.4 Controlling Active Gets (AGs)
Active gets enable the IOMMU to log a copy of the remotely ac-
cessed data locally. When a get succeeds and the returned data is
flowing from the main memory to the NIC, it is replicated by the
IOMMU and saved in the access log (see Figure 4b). Similar to ac-
tive puts, two additional PTE bits control the logging behavior of
such accesses: RL (Read Log) and RLD (Read Log Data) . If RL=1
then the IOMMU logs the fault entry for the read page. If RLD=1,
the IOMMU logs the fault entry and the returned data.
The proposed control bit extensions for active puts and active
gets can easily be implemented in practice. For example, bits 7-10
in the Intel IOMMU PTEs are ignored [34]. These bits can be used
to store WL, WLD, RL, and RLD.
3.5 Interactions with the Local CPU
Finally, the IOMMU has to notify the CPU to run a handler to pro-
cess the logs. Here, we discuss interrupts/polling and we propose a
new scheme where the IOMMU directly accesses the CPU, bypass-
ing the main memory.
Interrupts Here, one could use a high-performance MSI
wakeup mechanism analogous to the scheme in InfiniBand [56].
The developer specifies conditions for triggering interrupts (when
the amount of free space in an access log is below a certain thresh-
old, or at pre-determined intervals). If the access log is sufficiently
large and messages are pipelined then the interrupt latency may
not influence the overall performance significantly (cf. § 5.2).
Polling As the IOMMU inserts data directly into a user address
space, processes can monitor the access log head/tail pointers and
begin processing the data when required. Polling can be done ei-
ther directly by the user, or by a runtime system that runs the han-
dlers transparently to the user.
Direct CPUAccess This mechanism is motivated by the archi-
tectural trends to place scratchpadmemories on processing units, a
common practice in today’s NVIDIA GPUs [46] and several multi-
core architectures [36]. One could add a scratchpad to the CPU ,
connect it directly with the IOMMU , and place the head/tail
pointers in it. A dedicated hyperthread polls the pointers and
runs the handlers if a free entry is available . If the size of the
handler code is small, it can also be placed in the scratchpad, fur-
ther reducing the number of memory accesses .
The IOMMU and the CPU also have to synchronize while pro-
cessing the access log. This can be done with a simple lock for
mutual access. The IOMMU and the CPU could also synchronize
with the pointers from the access log table.
4
3.6 Consistency Model
We now enhance AA to enable a weak consistency model similar
to MPI-3 RMA [32]. In RMA, a blocking flush synchronizes non-
blocking puts/gets. In AA, we use an active flush (flush(int
target_id)) to enforce the completion of active accesses is-
sued by the calling process and targeted at target_id. One way
to implement active flushes could be to issue an active get tar-
geted at a special designated flushing page in the address space of
target_id. The IOMMU, upon intercepting this get, would wait
until the CPU processes the related access log and then it would
finish the get to notify the source that the accesses are committed.
Extending the IOMMU We add an IOMMU internal
data structure called the flushing buffer to store tuples
(address,IUID,active,requester-ID,tag), where
address is the address of the flushing page, active is a binary
value initially set to false, and requester-ID,tag are
values of two PCIe packet fields with identical names; they are
initially zeroed and we discuss them later in this section. The
flushing buffer is implemented as CAM for rapid access.
Selecting a Flushing Page To enable a selection of a flush-
ing page, the system could reserve a high virtual address for this
purpose. We then add a respective entry (with the selected address
and the related IUID) to the flushing buffer.
Finishing an Active Flush The IOMMU intercepts the
issued get, finds the matching entry in the flushing buffer,
sets active=true, copies the values of the tag and
requester-ID PCIe fields to the matching entry, and dis-
cards the get. Processing of a targeted access log is then initiated
with any scheme from § 3.5, depending on user’s choice.
Alternative Mechanism The proposed consistency mecha-
nism sacrifices one page from the user virtual address space. To
alleviate this, we offer a second scheme similar to the semantics
offered by, e.g., GASNet [17]. Here, AA does not guarantee any
consistency. Instead, it allows the user to develop the necessary
consistency by issuing a reply (implemented as an active put) from
within the handler. This reply informs which elements from the
access log have been processed. To save bandwidth, replies can be
batched. The reply would be targeted at a designated page (with
bits W=0, WL=1, WLD=1) with an IUID pointing to a designated ac-
cess log. The user would poll the log and use the replies to enforce
an arbitrary desired consistency.
Mixing AA/RMA Accesses At times, mixing AA and RMA
puts/gets may be desirable (see § 4.1). The consistency of such
a mixed scheme can be managed with active and traditional
RMA flushes: these two calls are orthogonal. Completing pending
AA/RMA accesses is enforced with AA/RMA flushes, respectively.
3.7 Hardware Implementation Issues
We now describe solutions to several PCIe and RDMA control flow,
ordering, and backward compatibility issues in the proposed exten-
sions. If the reader is not interested in these details then they may
skip this part and proceed directly to Section 4. Numbers and cap-
itals in circles refer to Figure 3.
Logging Data of PCIe Write Requests Every RDMA put is
translated into one or more PCIe write requests flowing from the
NIC to the main memory. The ordering rules for Posted Requests
from the PCIe specification [47] (§ 2.4.1, entry A2a) ensure that the
packets for the same request arrive in order. Consequently, such
packets can simply be appended to the log.
Logging Data of PCIe Read Requests A PCIe read transac-
tion consists of one read request (issued by the NIC) and one or
more read replies (issued by the memory controller). The IOMMU
has to properly match the incoming and outgoing PCIe packets.
For this, we first enable the IOMMU to intercept PCIe packets flow-
ing back to the NIC (standard IOMMUs process only incoming
memory accesses). Second, we add the packet tag buffer (im-
plemented as CAM) to the IOMMU to temporarily maintain infor-
mation about PCIe packets. The IOMMU would add the transac-
tion tags of incoming PCIe read requests that access a page where
RLD=1 to the tag buffer. PCIe read reply packets are then matched
against the buffer and logged if needed. We require the tag buffer
as PCIe read replies only contain seven lower bits of the address
of the accessed memory region (see § 2.2.9 in [47]), preventing the
IOMMU from matching incoming requests with replies. The PCIe
standard also ensures ordering of read replies (see § 2.3.1.1 in [47]).
Order of PCIe Packets from Multiple Devices The final
ordering issue concerns multiple RMA puts or gets concurrently
targeted at the same IOMMU. If several multi-packet accesses orig-
inate from different devices then the IOMMU may observe an
incoming arbitrary interleaving of PCIe packets. To correctly re-
assemble the packets in the access log, we extend the tag buffer so
that it also stores pointers into the access log. Upon intercepting
the first PCIe packet of a new PCIe transaction, the IOMMU in-
serts a tuple (transaction-tag, tail) into the tag buffer.
Then, the IOMMU records the packet in the access log, and adds
the size of the whole PCIe transaction to the tail pointer. Thus, if
some transactions interleave, the IOMMU leaves “holes” in the ac-
cess log and fill these holes when appropriate PCIe packets arrive1 .
The IOMMU removes an entry from the buffer after processing
the last transaction packet. To ensure that the CPU only processes
packets with no holes, the IOMMU increments tail or sets ap-
propriate synchronization variables only when each PCIe packet
of the next transaction is recorded in the respective access log.
Control Flow IOMMUs, unlike MMUs, cannot suspend a re-
mote process and thus buffers may overflow if they are not emptied
fast enough. To avoid data loss, we utilize the backpressure mech-
anism of the PCIe transaction layer protocol (TLP) as described in
§ 2.6.1. in [47]. This will eventually propagate through a reliable
network and block the sending process(es). Issues such as head of
line blocking and deadlocks are similar to existing reliable network
technologies and require efficient programming at the application
layer (regular emptying of the queues). Head of line blocking can
also be avoided by dropping packets and retransmission [6].
Support for Legacy Codes Some codes may rely on the de-
fault IOMMU behavior to buffer the metadata in the default fault
log . To cover such cases, we add the E bit to IOMMU PTEs
to determine if the page fault is recorded in the fault log (E=0) or
in one of the access logs (E=1).
1PCI Express 3.0 Specification limits the PCIe transaction size to 4KB. Thus, the max-
imum size of an active put or get also amounts to 4 KB. This limitation can be easily
overcome in future PCIe systems.
5
4 ACTIVE ACCESS PROGRAMMING
We now discuss example RMA-based codes that leverage AA. AA
improves the application performance by reducing the amount
of communication and remote synchronization, enhancing local-
ity [55]. First, it reduces the number of puts, gets, and remote
atomic operations in distributed data structures and other codes
that perform complex remote memory accesses. For example, en-
queueing an element into a remote queue costs at least two remote
accesses (atomically get and increment the tail pointer and put the
element). With AA, this would be a simple put to the list address
and a handler that inserts the element; our DHT example is very
similar. Second, as handlers are executed by local cores, the usual
on-node synchronization schemes are used with no need to issue
expensive remote synchronization calls, e.g., remote locks.
4.1 Designing Distributed Hash Tables
DHTs are basic data structures that are used to construct dis-
tributed key-value stores such as Memcached [26]. In AA, the DHT
is open and each process manages its part called the local volume.
The volume consists of a table of elements and an overflow heap
for elements with hash collisions. Both the table and the heap are
implemented as fixed-size arrays. To avoid costly array traversals,
pointers to most recently inserted items and to the next free cells
are stored along with the remaining data in each local volume.
Due to space constraints we discuss inserts and then we
only briefly outline lookups and deletes. In RMA, inserts are
based on atomics [53] (Compare-and-Swap and Fetch-and-Op, de-
noted as cas and fao), RMA puts (rma_put) and RMA flushes
(rma_flush); see Listing 2. For simplicity we assume that atom-
ics are blocking. The semantics of CAS are as follows: int
cas(elem, compare, target, owner); if compare ==
target then target is changed to elem and its previous
value is returned. For FAO we have int fao(op, value,
target, owner); it applies an atomic operationop totarget
using a parameter value, and returns target’s previous value.
In both cas and fao, owner is the id of the process that owns
the targeted address space. The semantics for rma_put and
rma_flush are the same as for AA puts and flushes (cf. § 3, § 3.6).
∅ indicates that the specific array cell is empty. To insert elemwe
first issue a cas (line 9). Upon a collision we acquire a new ele-
ment in the overflow heap (line 10). We then insert elem into the
new position (lines 12-13), update the respective last pointer and
the next pointer of the previous element in the heap (lines 14-17).
Implementation of Inserts with Active Puts We now ac-
celerate inserts with AA. We present the multi-threaded code in
Listing 3. The inserting process calls insert (lines 1-3). The
PTEs of the hash table data are marked with W=0, WL=1, WLD=1;
thus, the metadata and the data from the put is placed in the
access log. The CPU then (after being interrupted or by polling
the memory/scratchpad) executes insert_handler to insert
the elements into the local volume (lines 5-10). Here, we assume
that a thread owns one access log and that the size of the ac-
cess log is divisible by sizeof(int). Elements are inserted with
local_insert, a function similar to insert from Listing 2.
The difference is that each call is local (consequrntly, we skip the
lv.owner argument).
Synchronization AA handlers are executed by the local CPU,
thus, local_insert requires no synchronization with remote
processes. In our code we use local atomics, however, other simple
local synchronization mechanisms (e.g., locks or hardware trans-
actional memory) may also be utilized.
Consistency The proposed DHT is loosely consistent. For
implementing any other consistency (e.g., sequential consistency)
one can use either active flushes or enforce the required consis-
tency using replies from within the handler.
Lookups Contrary to inserts, lookups do not generate hash
collisions that entail multiple memory accesses. Thus, we propose
to implement a lookup as a single traditional RMA get, similarly
to the design in FaRM by Dragojevic et al. [22]. For this, we mark
the PTEs associatedwith the hashtable data as R=1, RL=0, RLD=0.
Here, we assume that DMA is cache coherent (true on, e.g., Intel
x86 [22]) and that RMA gets are aligned. As the DHT elements are
word-size integers, a get is atomic with respect to any concurrent
accesses from Listing 3. Consistency with other lookups and with
inserts can be achieved with RMA and active flushes, respectively.
More complicated schemes that fetch the data from the overflow
heap are possible; the details are outside the scope of the paper.
Deletes A simple protocol built over active puts performs
deletes. We use a designated page P marked as W=0, WL=1, WLD=1.
The delete implementation issues an active put. This put is tar-
geted at P and it contains a key of the element(s) to be deleted. The
IOMMU moves the keys to a designated access log and a specified
handler uses them to remove the elements from the local volume.
4.2 Collecting Statistics on Memory Accesses
Automatized and efficient systems for gathering various statistics
are an important research target. Recent work [28] presents an ac-
tive key-value store “Comet”, where automated gathering of statis-
tics is one of the key functionalities. Such systems are usually im-
plemented in the application layer, which significantly limits their
performance. Architectures based on traditional RMA suffer from
issues similar to the ones described in § 4.1.
AA enables hardware-based gathering statistics. For example,
to count the number of puts or gets to a data structure, one has to
appropriately set the control bits in the PTEs that point to themem-
ory region where this structure is located: W=1, WL=1, WLD=0 (for
puts), and R=1,RL=1,RLD=0 (for gets). Thus, the IOMMU ignores
the data and logs only metadata that is later processed in a handler
to generate statistics; the processing can be enforced with active
flushes. This mechanism would also improve the performance of
cache eviction schemes in memcached applications.
AA enables gathering separate statistics for each page of data.
Yet, sometimes a finer granularity could be required to count ac-
cesses to elements of smaller sizes. In such cases one could place
the respective elements in separate pages.
4.3 Enabling Incremental Checkpointing
Recent predictions about mean time between failures (MTBF) of
large-scale systems indicate failures every few hours [11]. Fault
tolerance can be achieved with various mechanisms. In check-
point/restart [11] all processes synchronize and record their state
to memories or disks. Traditional checkpointing schemes record
6
1 /* Volume is a structure that contains the fields:
2 owner : the id of the volume owner; vol_size: volume size,
3 elems []: the table + the overflow heap; each cell contains two
subfields: elem (the actual value) and ptr (the pointer to
the next element),
4 next_f ree_cell: a ptr to the next free cell in the heap,
5 last_ptr []: pointers to the most recent elements */
6
7 void insert(int elem, Volume v) {//put elem into volume v
8 int pos = hash(elem); //get the position of elem in v
9 if(cas(elem,∅,v.elems[pos].elem,v.owner) != ∅) {
10 int free_cell = fao(SUM,1,v.next_free_cell,v.owner);
11 if(free_cell>=v.vol_size) {/*an overflow - resize*/}
12 rma_put(elem,v.elems[free_cell].elem,v.owner);
13 rma_flush(v.owner);
14 int prev_ptr=fao(REPLACE,free_cell,v.last_ptr[pos], v.owner);
15 if(cas(free_cell,∅,v.elems[pos].ptr,v.owner) != ∅) {
16 rma_put(free_cell,v.elems[prev_ptr].ptr,v.owner);
17 rma_flush(v.owner); } } }
Listing 2: Insert in the traditional RMA-based DHT
1 void insert(int elem, Volume v) {
2 put(elem, v.elems[hash(elem)].elem, v.owner);
3 }
4
5 void insert_handler(Access_log log) {
6 while(log.tail != log.head) {
7 local_insert(*log.tail); log.tail += sizeof(int);
8 if(log.tail == log.base + log.size) {
9 log.tail = log.base;
10 } } }
11
12 void local_insert(int elem) {//lv is the local DHT volume
13 int pos = hash(elem); //get the position of elem in lv
14 if(cas(elem,∅,lv.elems[pos].elem) != ∅) {
15 int free_cell = fao(SUM,1,lv.next_free_cell);
16 if(free_cell>=lv.vol_size) {/*an overflow - resize*/}
17 lv.elems[free_cell].elem = elem;
18 int prev_ptr=fao(REPLACE,free_cell,lv.last_ptr[pos]);
19 if(cas(free_cell,∅,lv.elems[pos].ptr) != ∅) {
20 lv.elems[prev_ptr].ptr = free_cell; } } }
Listing 3: Insert in the AA-based DHT
the same amount of data during every checkpoint. However, of-
ten only a small subset of the application state changes between
two consecutive checkpoints [58]. Thus, saving all the data wastes
time, energy, and bandwidth. In incremental checkpointing only the
modified data is recorded. A popular scheme [58] tracks data mod-
ifications at the page granularity and uses the dirty bit (DB) to de-
tect if a given page requires checkpointing. This scheme cannot be
directly applied to RMA as memory accesses performed by remote
processes are not tracked by the MMU paging hierarchy [50].
AA enables incremental checkpointing in RMA codes. Bits W=1,
WL=1, WLD=0 (set to the data that requires checkpointing) enable
tracking the modified pages. To take a checkpoint all the processes
synchronize and process the access logs to find and record themod-
ified data. Our incremental checkpointing mechanism for tracking
data modifications is orthogonal to the details of synchronization
and recording; one could use any available scheme [11].
Most often both remote and local accesses modify the memory.
The latter can be tracked by the MMU and any existing method
(e.g., the DB scheme [58]) can be used. While checkpointing, every
process parses both the access log and theMMUpage table to track
both types of memory writes.
4.4 Reducing the Overheads of Logging
Another fault tolerance mechanism for RMA is uncoordinated
checkpointing combined with logging of puts and gets where the
crashed processes repeat their work and replay puts and gets that
modified their state before the failure; these puts and gets are
logged during the application runtime [11]. While logging puts
is simple and does not impact performance, logging gets wastes
network bandwidth because it requires transferring additional
data [11]. We now describe this issue and solve it with AA.
Logging Gets in Traditional RMA A get issued by process A
and targeted at another process B fetches data from the memory of
B and it impacts the state of A. Thus, if A fails and begins recovery,
it has to replay this get. Still, A cannot log this get locally as the
contents of its memory are lost after the crash (see Figure 5, part 1).
Thus, B can log the get [11].
The core problem in RMA is that B knows nothing of gets issued
by A, and cannot actively perform any logging. It means that A has
to wait for the data to be fetched from B and only then can it send
this data back to B. This naive scheme comes from the fundamental
rules of one-sided RMA communication: B is completely oblivious
to any remote accesses to its memory [11] (cf. Fig. 5, part 2).
Figure 5: Logging and replaying issued operations in RMA and AA.
Improving the Performance with Active Gets In AA, the
IOMMU can intercept incoming gets and log the accessed data lo-
cally. First, we set up PTEs in the IOMMU page table that point
to the part of memory that is targeted by gets. We set the control
bits (R=1,RL=1,RLD=1) in these PTEs to make each get touch-
ing this page active. Every such get triggers the IOMMU to copy
the accessed data into the access log annihilating the need for the
source to send the same data back (see Figure 5, part 3). During the
recovery a crashed process fetches the logs and then uses them to
recover to the state before the failure. We omit further details of
this scheme as this is outside the scope of this paper; example pro-
tocols (e.g., for clearing the logs or replaying puts and gets preserv-
ing the RMA consistency order) can be found in the literature [11].
5 EVALUATION
To evaluate AA we first conduct cycle-accurate simulations to
cover the details of the interaction between the NIC, the IOMMU,
the CPU, and the memory system. Second, we perform simplified
large-scale simulations to illustrate how AA impacts the perfor-
mance of large-scale codes.
5.1 Microbenchmarks
We first perform cycle-accurate microbenchmarks that evaluate
the performance of data transfer between twomachines connected
with an Ethernet link. We compare system configurations with-
out the IOMMU (no-iommu) and with the extended IOMMU pre-
sented in this paper (e-iommu). We use the gem5 cycle-accurate
7
800
850
900
950
1000
30 60 90 120 150
Packet size [B]
B
a
n
d
w
id
th
 [
M
b
/s
]
no-iommu
e-iommu
(a) Performance of Netmap.
250
500
750
1000
30 60 90 120 150
Packet size [B]
B
a
n
d
w
id
th
 [
M
b
/s
] no-iommu(TCP)
e-iommu
(TCP)
no-iommu
(UDP)
e-iommu
(UDP)
(b) Performance of Netperf.
0.4
0.8
1.2
1.6
0.0 0.2 0.4 0.6 0.8
Collision rate
M
il
li
o
n
 i
n
s
e
rt
s
/s
e
c
o
n
d
Performance drop
caused by remote accesses
Performance drop
caused by local accesses
AA:
RMA:
(c) Performance of the DHT.
0
25
50
75
100
0 10 20 30
Processes (running handlers)
M
illi
on
 in
se
rts
/s
ec
on
d Nr of procs:512
256 12864
(d) Finding the best C .
Figure 6: (§ 5.1, § 5.2.1) Microbenchmarks (Figures 6a-6c) and finding optimum configuration for AA-Onload (Figure 6d).
full-system simulator [16] and a standard testbed that allows
for modeling two networked machines with in-order CPUs, Intel
8254x 1GbE NICs with Intel e1000 driver, a full operating system,
TCP/IP stack, and PCIe buses. The utilized OS is Ubuntu 11.04 with
precompiled 3.3.0-rc3 Linux kernel that supports 2047MB mem-
ory. We modify the simulated system by splitting the PCIe bus
and inserting an IOMMU in between the two parts. We model
the IOMMU as a bridge with an attached PTE cache (IOTLB). The
bridge provides buffering and a fixed delay for passing packets; we
set the delay to be 70ns for each additional memory access. We also
use a 5ns delay for simulating IOMMU internal processing.We base
these values on the L1/memory latencies of the simulated system.
We first test data transfer with PktGen [43] (a high-speed packet
generator) enhanced with netmap[51] (a framework for fast packet
I/O). Second, we evaluate the performance of a TCP and a UDP
stream with netperf, a popular benchmark for measuring network
performance. We show the results in Figures 6a-6b. The IOMMU
presence only marginally affects the data transfer bandwidth (the
difference between the no-iommu and e-iommu is 1-5% with
no-iommu, as expected, being marginally faster).
We also simulate a hashtable workload of one process insert-
ing new elements at full link bandwidth into the memory of the
remote machine; see Figure 6c. Here, we compare AA with a tra-
ditional RMA design of the DHT. As the collision rate increases,
the performance of both designs drops due to a higher number of
memory accesses. Still, AA is ≈3 times more performant than RMA.
5.2 Evaluation of Large-Scale Applications
The second performance-related question is how the AA seman-
tics, implemented using the proposed Active Access and the
IOMMU design, impacts the performance of large-scale codes. To
be able to run large-scale benchmarks on a real supercomputer,
we simplify the simulation infrastructure. We simulate one-sided
RMA accesses with MPI point-to-point messages. We replace one-
directional RMA puts with a single message and two-directional
RMA calls (gets and atomics) with a pair of messages exchanged
by the source and target, analogously to packets in hardware. We
then emulate extended IOMMUs by appropriately stalling message
handlers. As the IOMMU performs data replication and redirection
bypassing the CPU, there are four possible sources of such over-
heads: interrupts, memory accesses due to the logging, IOMMU
page table lookups, and accesses to the scratchpads on the CPU.
First, we determine the interrupt and memory access latencies
on our system to be 3µs and ≈70ns , respectively. Second, we simu-
late the IOTLB and page table lookups varying several parameters
(PTE size, associativity, eviction policy). Finally, we assume that an
access to the scratchpad to notify a polling hyperthread is equal to
the cost of an L3 access and we evaluate it to ≈15ns .
All the experiments are executed on the CSCS Monte Rosa
Cray XE6 system. Each node contains four 8-core AMD processors
(Opteron 6276 Interlagos 2.3 GHz) and is connected to a 3D-Torus
Gemini network. We use 32 processes/node and the GNU Environ-
ment 4.1.46 for compiling.
We compare the following communication schemes:
AA-Int, AA-Poll, AA-SP: AA based on the IOMMU communicat-
ing with the CPU using: interrupts, polling themainmemory, and
accessing the scratchpad, respectively.
RMA: traditional RMA representing RDMA architectures.
AM: an AM scheme in which processes poll at regular intervals
to check for messages. Note that this protocol is equivalent to
traditional message passing.
AM-Exp: an AM variant based on exponential backoff to reduce
polling overhead. If there is no incoming message, we double the
interval after which a process will poll.
AM-Onload: an AM scheme where several cores are only dedi-
cated to running AM handlers and constantly poll on flags that
indicate whether new AMs have to be processed.
AM-Ints: an AMmechanism based on interrupts generated by the
NIC that signal to the CPU it has to run the handler.
5.2.1 Distributed Hashtable. We implement eight hashtable vari-
ants using the above schemes. Processes insert random elements
with random keys (deletes give similar performance and are
skipped). Each DHT volume can contain 221 elements. We vary dif-
ferent parameters to cover a broad spectrum of possible scenarios.
First, we study the scalability by changing the number of insert-
ing processes P . Second, we evaluate benchmarks with different
numbers of hash collisions (Rcols , the ratio between the number
of hash collisions and the total insert count). Third, we simulate
different applications by varying computation ratios (Rcomp , the
ratio between the time spent on local computation and the total
experiment runtime). We also vary the IOTLB parameters: IOTLB
size, associativity, and the eviction policy. Finally, we test two vari-
ants of AA-Ints and AM-Ints in which an interrupt is issued
8
050
100
150
200
250 500 750 1000
Processes
M
ill
io
n
 i
n
s
e
rt
s
/s
e
c
o
n
d
Comm. scheme:
AA−SP
AA−Poll
AM−Ints
AA−Ints
AM−Onload
AM−Exp
AM
RMA
AA-SP≈
A -Poll


-Int
s≈






	



≈


(a) DHT, Rcols ≈5%.
0
50
100
150
200
250 500 750 1000
Processes
M
ill
io
n
 i
n
s
e
rt
s
/s
e
c
o
n
d
Comm. scheme:
AA−SP
AA−Poll
AM−Ints
AA−Ints
AM−Onload
AM−Exp
AM
RMA
Pem





ff
fi
fl
ffi

 
!
b
"
#
$
%&
'
(
)
*
+
,
-
.
/
0
1
i
2
3
4
5
6
P78
9
:
;
<
=
>
?
@
l
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
c
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
`
a
d
f
g
(b) DHT, Rcols ≈25%.
50
60
16 32 64 128 256 512 1024
IOTLB entries
M
illi
on
 in
se
rts
/s
ec
on
d
IOTLB design
lru_af
rnd_af
lru_a4
lru_a2
rnd_a4
lru_a1
rnd_a2
rnd_a1
(c) DHT, IOTLB analysis.
20
40
60
80
250 500 750 1000
Processes
M
il
lo
n
 a
c
c
e
s
s
e
s
/s
e
c
o
n
d
h
j
k
n
o
t
p
q
-Poll
r
s
t
u
v
w
x
educe
Ayz{
(d) Access Counter results.
Figure 7: (§ 5.2.1, § 5.2.2) The performance of the DHT (Figures 7a, 7b, 7c) and the access counter (Figure 7d). We use 32 processes/node.
every 10 and 100 inserts (performance differences were negligible
(<5%); we only report numbers for the former).
AM-Onload depends on the number of cores (C) per node that
are dedicated to processing AM requests. Thus, for a fair compari-
son, we run AM-Onload for every C between 1 and 31 to find the
most advantageous configuration for every experiment. Figure 6d
shows that C = 11 delivers maximum performance. If C < 11 the
cores become congested and the performance decreases. C > 11
limits performance as receiving cores become underutilized.
Varying P and Rcols Figure 7a shows the results for Rcols = 5%
and Figure 7b for Rcols = 25%. Here, both AA-SC and AA-Poll
outperform all other schemes by ≈2. As expected, AA-SC is
slightly (≈1%) more performant than AA-Poll. AA-Ints is com-
parable to AM-Ints; both mechanisms suffer from interrupt la-
tency overheads. The reasons for performance differences in the
remaining schemes are as follows: in AM-Exp and AM the comput-
ing processes have to poll on the receive buffer and, upon active
message arrival, extract the payload. In AA this is managed by the
IOMMU and the computing processes only insert the elements into
the local hashtable. AM-Onload devotes fewer processes to com-
pute and thus, even in its best configuration, cannot outperform
AA. RMA issues costly atomics [53] for every insert and 6 more
remote operations for every collision, degrading performance.
Varying Rcomp Increasing Rcomp from 0% to 95% did not sig-
nificantly influence the performance patterns between evaluated
schemes. The only noticeable effect was that the differences be-
tween the results of respective schemes were smaller (which is
expected as scaling Rcomp reduces appropriately the amount of
communication per time unit).
Varying the IOTLB Parameters We now analyze the influ-
ence of various IOTLB parameters on the performance of DHT; the
results are presented in Figure 7c. The name of each plot encodes
the used eviction policy (lru: least recently used, rnd: random)
and associativity (a1: direct-mapped, a2: 2-way, a4: 4-way, af:
fully-mapped). For plot clarity we only analyze AA-Poll; both
AA-SP and AA-Ints follow similar performance patterns. For
a given associativity, lru is always better than rnd as it entails
fewer IOTLB misses. Increasing associativity and IOTLB size im-
proves the performance, for example, using lru_af instead of
lru_a4 allows for an up to 16% higher insert rate.
5.2.2 Access Counter. We now evaluate a simple tool that counts
accesses to an arbitrary data structure. We compare AA-Poll
(counting done by the IOMMU), RMA (increasing counters with
remote atomics), and two additional designs: an approach based
on the “active key-value” store [28] (A-KV), and a scheme where
counting is done at the source and the final sums are computed
with theAllreduce collective operation [41] (Allreduce). Finally,
we consider no counting (No-Cnt). The number of accesses per
second is presented in Figure 7d. AA-Poll outperforms A-KV
(overheads caused by the application-level design), RMA (issuing
costly atomics), and All-Reduce (expensive synchronization).
5.2.3 Performant Logging of Gets. In the next step we evaluate
the performance of active gets by testing the implementation of
the mechanism that logs RMA gets. Here, processes issue remote
gets targeted at random processes. Every get transfers one 8-byte
integer value. In this benchmark we do not compare to AM-Exp,
AM-Onload, and AM-Ints because these schemes were not suit-
able for implementing this type of application. Instead, we com-
pare to No-FT: a variant with no logging (no fault-tolerance over-
head) that constitutes the best-case baseline. We illustrate the scal-
ability of AA in Figure 8a. AA achieves the best performance, close
to No-FT. In all the remaining protocols the data to be logged has
to be transferred back to a remote storage using a put (RMA) or
a send (AM), which incurs significant overheads. Varying the re-
maining parameters (Rcomp , IOTLB parameters) follows the same
performance pattern as in the hashtable evaluation.
5.2.4 Fault-Tolerant Performant Sort. To evaluate the perfor-
mance of active gets we also implemented an RMA-based version
of the parallel sort Coral Benchmark [19] that utilizes gets instead
of messages, and made it fault-tolerant. We present the total time
required to communicate the results of sorting 1GB of data be-
tween processes in Figure 8b. Again, AA is close to No-FT (< 1%)
and reduces communication time by ≈50% and ≈80% in compari-
son to AM and RMA, respectively.
6 RELATED WORK AND DISCUSSION
Not all possible use cases for IOMMUs have been studied so
far. Ben-Yehuda et al. [9] discuss IOMMUs for virtualization in
Linux. Other works target efficient IOMMU emulation [4], reduc-
ing IOTLB miss rates [3], isolating Linux device drivers [18], and
mitigating IOMMUs’ overheads [8]. There are also vendors’ white
9
1e+07
1e+08
250 500 750 1000
Processes
L
o
g
g
e
d
 g
e
ts
/s
e
c
o
n
d
|
}
~




-Pol
l





(a) Logging gets (§ 5.2.3)
0.0
0.1
0.2
0.3
0.4
0.5
250 500 750 1000
Processes
L
a
te
n
c
y
 [
s
]


Ł




-Poll





(b) Sort Time (§ 5.2.4)
Figure 8: The performance of the AA-based fault tolerance scheme.
papers and specifications [1, 2, 5, 33, 34, 40, 48]. Our work goes
beyond these studies by proposing a new a mechanism and a pro-
gramming model that combines AM with RMA and uses IOMMUs
for high-performance distributed data-centric computations.
There are several mechanisms that extend the memory sub-
system to improve the performance of various codes. Active
Pages [45] enable the memory to perform some simple operations
allowing the CPU to operate at peak computational bandwidth. Ac-
tive Memory [24] and in-memory computing [61] add simple com-
pute logic to the memory controller and the memory itself, respec-
tively. AA differs from these schemes as it targets distributed RMA
computations and its implementation only requires minor exten-
sions to the commodity IOMMUs.
Scale-Out NUMA [42] is an architecture, programming model,
and communication protocol that offers low latency of remote
memory accesses. It differs from AA as it does not provide the ac-
tive semantics for both puts and gets and it introduces significant
changes to the memory subsystem.
Active messages were introduced by von Eicken et al [59].
Atomic Active Messages [12], a variant of AM, accelerate different
graph processing workloads [14] by running handlers atomically
in response to the incoming messages. The execution of the han-
dlers is atomic thanks to hardware transactional memory. While
AM and AAM focus on incoming messages, AA specifically targets
RMA puts and gets, and its design based on IOMMUs is able to pro-
cess single packets. AA could also possibly be used to accelerate
irregular distributed workloads such as graph databases [13], gen-
eral distributed graph processing [14, 30, 38, 39, 54], or deep learn-
ing [7]. Scalable programming for RMA was discussed in differ-
ent works [29, 52]. Some of AA’s functionalities could be achieved
using RMA/AM interfaces such as Portals [6], InfiniBand [56], or
GASNet [17]. However, Portals would introduce additional mem-
ory overheads per NIC because it requires descriptors for every
memory region. These overheads may grow prohibitively for mul-
tiple NICs. Contrarily, AA uses a single centralized IOMMU with
existing paging structures, ensuring no additional memory over-
heads. Furthermore, AA offers notifications on gets and it enables
various novel schemes such as incremental checkpointing for RMA
and performant logging of gets.
AA could also be implemented in the NIC [21, 25, 31]. Still, using
IOMMUs provides several advantages. Modern IOMMUs are inte-
grated with the memory controller/CPU and thus can be directly
connected with CPU stratchpads for a high-performance notifica-
tion mechanism (see § 3.5). This way, all I/O devices could take ad-
vantage of this functionality (e.g., Ethernet RoCE NICs). Moreover,
we envision other future mechanisms that would enable even fur-
ther integration with the CPU. For example, the IOMMU could be
directly connected to the CPU instruction pipeline to directly feed
the CPU with handler code. Finally, one could implement AA us-
ing reconfigurable architectures [10, 15, 20, 27, 37]. We leave these
directions for future work.
Finally, AA’s potential can be further explored to provide hard-
ware virtualization of remote memories. There are three major
advantages of virtual memory: it enables an OS to swap mem-
ory blocks into disk, it facilitates the application development by
providing processes with separate address spaces, and it enables
useful features such as memory protection or dirty bits. Some
schemes (e.g., PGAS languages) emulate a part of these functionali-
ties for networked memories. Extending AA with features specific
to MMU PTEs (e.g., invalid bits) would enable a hardware-based
virtual global address space (V-GAS) with novel enhanced paging
capabilities and data-centric handlers running transparently to any
code accessing the memory; see Fig. 9.
Figure 9: V-GAS together with some example features.
The IOMMUs could also become the basis of V-GAS for Ether-
net. All the described IOMMU extensions are generic and do not
rely on any specific NIC features, leaving the possibility of moving
the V-GAS potential into commoditymachines that do not provide
native RDMA support. For example, by utilizing Single Root I/O
Virtualization (SR/IOV), a standard support for hardware virtual-
ization combinedwithmultiple receive and transmit rings, one can
utilize IOMMUs to safely divert traffic right into userspace.
7 CONCLUSION
RMA is becoming more popular for programming datacenters and
HPC computers. However, its traditional one-sided model of com-
munication may incur performance penalties in several types of
applications such as DHT.
To alleviate this issue we propose the Active Access scheme that
utilizes IOMMUs to provide hardware support for active semantics
in RMA. For example, our AA-based DHT implementation offers a
speedup of two over optimized AMs. The novel AA-based fault tol-
erance protocol enables performant logging of gets and adds neg-
ligible (1-5%) overheads to the application runtime. Furthermore,
AA enables new schemes such as incremental checkpointing in
RMA. Finally, our design bypasses the OS and enables more effec-
tive programming of datacenters and HPC centers.
10
AA enables a new programming model that combines the bene-
fits of one-sided communication and active messages. AA is data-
centric as it enables triggering handlers when certain data is ac-
cessed. Thus, it could be useful for future data processing and anal-
ysis schemes and protocols.
The proposed AA design, based on IOMMUs, shows the poten-
tial behind currently available off-the-shelf hardware for develop-
ing novel mechanisms. By moving the notification functionality
from the NIC to the IOMMU we adopt the existing IOMMU pag-
ing structures and we eliminate the need for expensive memory de-
scriptors present in, e.g., Portals, thus reducingmemory overheads.
The IOMMU-based design may enable even more performant noti-
fication mechanisms such as direct access from the IOMMU to the
CPU pipeline. Thus, AAmay play an important role in designing ef-
ficient codes and OS/runtime in large datacenters, HPC computers,
and highly parallel manycore environments which are becoming
commonplace even in commodity off-the-shelf computers.
ACKNOWLEDGEMENTS
We thank the CSCS team granting access to the Monte Rosa ma-
chine, and for their excellent technical support. We thank Greg
Bronevetsky (LLNL) for inspiring comments and Ali Saidi for help
with the gem5 simulator. MB is supported by the 2013 Google Eu-
ropean Doctoral Fellowship in Parallel Computing.
REFERENCES
[1] AMD. Software Optimization Guide for the AMD64 Processors, 2005.
[2] AMD. AMD I/O Virtualization Technology (IOMMU) Spec., 2011.
[3] N. Amit, M. Ben-Yehuda, and B.-A. Yassour. IOMMU: strategies for mitigating
the IOTLB bottleneck. In Proc. of Intl. Conf. on Comp. Arch., ISCA’10, pages 256–
274, 2010.
[4] N. Amit et al. vIOMMU: efficient IOMMU emulation. In USENIX Ann. Tech. Conf.,
USENIXATC’11, pages 6–6, 2011.
[5] W. J. Armstrong et al. Advanced virtualization capabilities of POWER5 systems.
IBM J. Res. Dev., 49(4/5):523–532, 2005.
[6] B. W. Barrett et al. The Portals 4.0 network programming interface, 2012. Sandia
National Laboratories.
[7] T. Ben-Nun, M. Besta, S. Huber, A. N. Ziogas, D. Peter, and T. Hoefler. A mod-
ular benchmarking infrastructure for high-performance and reproducible deep
learning. arXiv preprint arXiv:1901.10183, 2019.
[8] Ben-Yehuda et al. The price of safety: Evaluating IOMMU performance. In
Ottawa Linux Symp.(OLS), pages 9–20, 2007.
[9] M. Ben-Yehuda et al. Utilizing IOMMUs for virtualization in Linux and Xen. In
In Proc. of the Linux Symp., 2006.
[10] M. Besta, M. Fischer, T. Ben-Nun, J. De Fine Licht, and T. Hoefler. Substream-
centric maximum matchings on fpga. In Proceedings of the 2019 ACM/SIGDA In-
ternational Symposium on Field-Programmable Gate Arrays, pages 152–161. ACM,
2019.
[11] M. Besta and T. Hoefler. Fault Tolerance for Remote Memory Access Program-
ming Models. In Proc. of the 23rd Intl Symp. on High-perf. Par. and Dist. Comp.,
HPDC ’14, pages 37–48, 2014.
[12] M. Besta and T. Hoefler. Accelerating irregular computations with hardware
transactional memory and active messages. In Proceedings of the 24th Inter-
national Symposium on High-Performance Parallel and Distributed Computing,
pages 161–172. ACM, 2015.
[13] M. Besta, E. Peter, R. Gerstenberger, M. Fischer, M. Podstawski, C. Barthels,
G. Alonso, and T. Hoefler. Demystifying graph databases: Analysis and taxon-
omy of data organization, system designs, and graph queries. arXiv preprint
arXiv:1910.09017, 2019.
[14] M. Besta, M. Podstawski, L. Groner, E. Solomonik, and T. Hoefler. To push or to
pull: On reducing communication and synchronization in graph computations.
In Proceedings of the 26th International Symposium on High-Performance Parallel
and Distributed Computing, pages 93–104. ACM, 2017.
[15] M. Besta, D. Stanojevic, J. D. F. Licht, T. Ben-Nun, and T. Hoefler. Graph process-
ing on fpgas: Taxonomy, survey, challenges. arXiv preprint arXiv:1903.06697,
2019.
[16] N. Binkert et al. The gem5 simulator. SIGARCH Comput. Archit. News, 39(2):1–7,
Aug. 2011.
[17] D. Bonachea. GASNet Spec., v1. Tech. Rep. UCB/CSD-02-1207, 2002.
[18] S. Boyd-Wickizer and N. Zeldovich. Tolerating malicious device drivers in Linux.
In USENIX Ann. Tech. Conf., USENIXATC’10, pages 9–9, 2010.
[19] Coral Collaboration. Coral Procurement Benchmarks. In Coral Vendor Meeting,
2013.
[20] J. de Fine Licht, M. Blott, and T. Hoefler. Designing scalable fpga architectures
using high-level synthesis. ACM SIGPLAN Notices, 53(1):403–404, 2018.
[21] S. Di Girolamo, K. Taranov, A. Kurth, M. Schaffner, T. Schneider, J. Beránek,
M. Besta, L. Benini, D. Roweth, and T. Hoefler. Network-accelerated non-
contiguous memory transfers. arXiv preprint arXiv:1908.08590, 2019.
[22] A. Dragojević et al. FaRM: fast remote memory. In Proc. of the 11th USENIX
Symp. on Net. Syst. Des. and Impl. (NSDI 14). USENIX, 2014.
[23] H. Esmaeilzadeh et al. Dark silicon and the end of multicore scaling. In Proc. of
Intl. Symp. Comp. Arch., ISCA ’11, pages 365–376, 2011.
[24] Z. Fang et al. Active Memory Operations. In Proc. of the 21st Ann. Intl Conf. on
Supercomp., ICS ’07, pages 232–241, 2007.
[25] D. Firestone, A. Putnam, S. Mundkur, D. Chiou, A. Dabagh, M. Andrewartha,
H. Angepat, V. Bhanu, A. Caulfield, E. Chung, et al. Azure accelerated network-
ing: Smartnics in the public cloud. In 15th USENIX Symposium on Networked
Systems Design and Implementation (NSDI 18), pages 51–66, 2018.
[26] B. Fitzpatrick. Distributed caching with memcached. Linux journal, 2004(124):5,
2004.
[27] S. Gao, A. G. Schmidt, and R. Sass. Hardware implementation of mpi_barrier
on an fpga cluster. In 2009 International Conference on Field Programmable Logic
and Applications, pages 12–17. IEEE, 2009.
[28] R. Geambasu et al. Comet: An Active Distributed Key-value Store. In USENIX
Conf. on Op. Sys. Des. and Impl., OSDI’10, pages 1–13, 2010.
[29] R. Gerstenberger, M. Besta, and T. Hoefler. Enabling Highly-scalable Remote
Memory Access Programming with MPI-3 One Sided. In Proc. of ACM/IEEE
Supercomputing, SC ’13, pages 53:1–53:12, 2013.
[30] L. Gianinazzi, P. Kalvoda, A. De Palma,M. Besta, and T. Hoefler. Communication-
avoiding parallel minimum cuts and connected components. In ACM SIGPLAN
Notices, volume 53, pages 219–232. ACM, 2018.
[31] T. Hoefler, S. Di Girolamo, K. Taranov, R. E. Grant, and R. Brightwell. spin:
High-performance streaming processing in the network. In Proceedings of the
International Conference for High Performance Computing, Networking, Storage
and Analysis, page 59. ACM, 2017.
[32] T. Hoefler et al. Remote Memory Access Programming in MPI-3. ACM Trans.
Par. Comp. (TOPC), 2015. accepted for publication on Dec. 4th.
[33] IBM. Logical Partition Security in the IBM @server pSeries 690. 2002.
[34] Intel. Intel Virtualization Technology for Directed I/O (VT-d) Architecture Spec-
ification, September 2013.
[35] ISO Fortran Committee. Fortran 2008 Standard (ISO/IEC 1539-1:2010). 2010.
[36] Y. Kim, D. Broman, J. Cai, and A. Shrivastaval. WCET-aware dynamic code
management on scratchpads for software-managed multicores. In IEEE Real-
Time and Emb. Tech. and App. Symp. (RTAS), 2014.
[37] J. d. F. Licht, M. Besta, S. Meierhans, and T. Hoefler. Transformations of
high-level synthesis codes for high-performance computing. arXiv preprint
arXiv:1805.08288, 2018.
[38] A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. Challenges in parallel
graph processing. Parallel Processing Letters, 17(01):5–20, 2007.
[39] G. Malewicz,M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Cza-
jkowski. Pregel: a system for large-scale graph processing. In Proceedings of the
2010 ACM SIGMOD International Conference on Management of data, pages 135–
146. ACM, 2010.
[40] R. Mijat and A. Nightingale. The ARM Architecture Virtualization Extensions
and the importance of System MMU for virtualized solutions and beyond, 2011.
ARM White Paper.
[41] MPI Forum. MPI: A Message-Passing Interface Standard. Ver. 3, 2012.
[42] S. Novakovic et al. Scale-out NUMA. In Intl. Conf. on Arch. Sup. for Prog. Lang.
and Op. Sys., ASPLOS ’14, pages 3–18, 2014.
[43] R. Olsson. PktGen the linux packet generator. In Proc. of the Linux Symp., Ottawa,
Canada, volume 2, pages 11–24, 2005.
[44] Oracle. UltraSPARC Virtual Machine Spec. 2010.
[45] M. Oskin, F. T. Chong, and T. Sherwood. Active Pages: A Computation Model
for Intelligent Memory. In Proc. of the 25th Ann. Intl Symp. on Comp. Arch., ISCA
’98, pages 192–203, 1998.
[46] D. Patterson. The top 10 innovations in the new NVIDIA Fermi architecture,
and the top 3 next challenges. NVIDIA Whitepaper, 2009.
[47] PCI-SIG. PCI Express Base Spec. Rev. 3.0. 2010.
[48] PCI-SIG. PCI-SIG I/O Virtualization (IOV) Specifications, 2013.
[49] S. Pope and D. Riddoch. Introduction to OpenOnload, 2011. SolarFlare White
Paper.
[50] R. Recio, B. Metzler, P. Culley, J. Hilland, and D. Garcia. A remote direct memory
access protocol specification, Oct 2007. RFC 5040.
[51] L. Rizzo. netmap: A novel framework for fast packet i/o. In USENIX Annual
Technical Conference, pages 101–112, 2012.
11
[52] P. Schmid, M. Besta, and T. Hoefler. High-performance distributed rma locks.
In Proceedings of the 25th ACM International Symposium on High-Performance
Parallel and Distributed Computing, pages 19–30. ACM, 2016.
[53] H. Schweizer, M. Besta, and T. Hoefler. Evaluating the cost of atomic operations
on modern architectures. In 2015 International Conference on Parallel Architec-
ture and Compilation (PACT), pages 445–456. IEEE, 2015.
[54] E. Solomonik, M. Besta, F. Vella, and T. Hoefler. Scaling betweenness centrality
using communication-efficient sparse matrix multiplication. In Proceedings of
the International Conference for High Performance Computing, Networking, Stor-
age and Analysis, page 47. ACM, 2017.
[55] A. Tate, A. Kamil, A. Dubey, A. Größlinger, B. Chamberlain, B. Goglin, C. Ed-
wards, C. J. Newburn, D. Padua, D. Unat, et al. Programming abstractions for
data locality. PADAL Workshop 2014, April 28–29, Swiss National Supercom-
puting Center . . . , 2014.
[56] The InfiniBand Trade Association. Infiniband Architecture Spec. Vol. 1-2, Rel. 1.3.
InfiniBand Trade Association, 2004.
[57] UPC Consortium. UPC language spec., v1.2. Technical report, Lawrence Berke-
ley National Laboratory, 2005. LBNL-59208.
[58] M. Vasavada, F. Mueller, P. H. Hargrove, and E. Roman. Comparing different
approaches for incremental checkpointing: The showdown. In Linux’11: The
13th Annual Linux Symposium, pages 69–79, 2011.
[59] T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active messages:
a mechanism for integrated communication and computation. In Proc. of Intl.
Symp. Comp. Arch., ISCA ’92, pages 256–266, 1992.
[60] J. Willcock et al. AM++: AGeneralized ActiveMessage Framework. In Intl. Conf.
on Par. Arch. and Comp. Tech., pages 401–410, 2010.
[61] Q. Zhu et al. Accelerating sparse matrix-matrix multiplication with 3D-stacked
logic-in-memory hardware. In High Perf. Ext. Comp. Conf. (HPEC), pages 1–6.
IEEE, 2013.
12
