31,939 research outputs found
The Cost of Address Translation
Modern computers are not random access machines (RAMs). They have a memory
hierarchy, multiple cores, and virtual memory. In this paper, we address the
computational cost of address translation in virtual memory. Starting point for
our work is the observation that the analysis of some simple algorithms (random
scan of an array, binary search, heapsort) in either the RAM model or the EM
model (external memory model) does not correctly predict growth rates of actual
running times. We propose the VAT model (virtual address translation) to
account for the cost of address translations and analyze the algorithms
mentioned above and others in the model. The predictions agree with the
measurements. We also analyze the VAT-cost of cache-oblivious algorithms.Comment: A extended abstract of this paper was published in the proceedings of
ALENEX13, New Orleans, US
Near-Memory Address Translation
Memory and logic integration on the same chip is becoming increasingly cost
effective, creating the opportunity to offload data-intensive functionality to
processing units placed inside memory chips. The introduction of memory-side
processing units (MPUs) into conventional systems faces virtual memory as the
first big showstopper: without efficient hardware support for address
translation MPUs have highly limited applicability. Unfortunately, conventional
translation mechanisms fall short of providing fast translations as
contemporary memories exceed the reach of TLBs, making expensive page walks
common.
In this paper, we are the first to show that the historically important
flexibility to map any virtual page to any page frame is unnecessary in today's
servers. We find that while limiting the associativity of the
virtual-to-physical mapping incurs no penalty, it can break the
translate-then-fetch serialization if combined with careful data placement in
the MPU's memory, allowing for translation and data fetch to proceed
independently and in parallel. We propose the Distributed Inverted Page Table
(DIPTA), a near-memory structure in which the smallest memory partition keeps
the translation information for its data share, ensuring that the translation
completes together with the data fetch. DIPTA completely eliminates the
performance overhead of translation, achieving speedups of up to 3.81x and
2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure
Glider: A GPU Library Driver for Improved System Security
Legacy device drivers implement both device resource management and
isolation. This results in a large code base with a wide high-level interface
making the driver vulnerable to security attacks. This is particularly
problematic for increasingly popular accelerators like GPUs that have large,
complex drivers. We solve this problem with library drivers, a new driver
architecture. A library driver implements resource management as an untrusted
library in the application process address space, and implements isolation as a
kernel module that is smaller and has a narrower lower-level interface (i.e.,
closer to hardware) than a legacy driver. We articulate a set of device and
platform hardware properties that are required to retrofit a legacy driver into
a library driver. To demonstrate the feasibility and superiority of library
drivers, we present Glider, a library driver implementation for two GPUs of
popular brands, Radeon and Intel. Glider reduces the TCB size and attack
surface by about 35% and 84% respectively for a Radeon HD 6450 GPU and by about
38% and 90% respectively for an Intel Ivy Bridge GPU. Moreover, it incurs no
performance cost. Indeed, Glider outperforms a legacy driver for applications
requiring intensive interactions with the device driver, such as applications
using the OpenGL immediate mode API
The Virtual Block Interface: A Flexible Alternative to the Conventional Virtual Memory Framework
Computers continue to diversify with respect to system designs, emerging
memory technologies, and application memory demands. Unfortunately, continually
adapting the conventional virtual memory framework to each possible system
configuration is challenging, and often results in performance loss or requires
non-trivial workarounds. To address these challenges, we propose a new virtual
memory framework, the Virtual Block Interface (VBI). We design VBI based on the
key idea that delegating memory management duties to hardware can reduce the
overheads and software complexity associated with virtual memory. VBI
introduces a set of variable-sized virtual blocks (VBs) to applications. Each
VB is a contiguous region of the globally-visible VBI address space, and an
application can allocate each semantically meaningful unit of information
(e.g., a data structure) in a separate VB. VBI decouples access protection from
memory allocation and address translation. While the OS controls which programs
have access to which VBs, dedicated hardware in the memory controller manages
the physical memory allocation and address translation of the VBs. This
approach enables several architectural optimizations to (1) efficiently and
flexibly cater to different and increasingly diverse system configurations, and
(2) eliminate key inefficiencies of conventional virtual memory. We demonstrate
the benefits of VBI with two important use cases: (1) reducing the overheads of
address translation (for both native execution and virtual machine
environments), as VBI reduces the number of translation requests and associated
memory accesses; and (2) two heterogeneous main memory architectures, where VBI
increases the effectiveness of managing fast memory regions. For both cases,
VBI significanttly improves performance over conventional virtual memory
On Making Emerging Trusted Execution Environments Accessible to Developers
New types of Trusted Execution Environment (TEE) architectures like TrustLite
and Intel Software Guard Extensions (SGX) are emerging. They bring new features
that can lead to innovative security and privacy solutions. But each new TEE
environment comes with its own set of interfaces and programming paradigms,
thus raising the barrier for entry for developers who want to make use of these
TEEs. In this paper, we motivate the need for realizing standard TEE interfaces
on such emerging TEE architectures and show that this exercise is not
straightforward. We report on our on-going work in mapping GlobalPlatform
standard interfaces to TrustLite and SGX.Comment: Author's version of article to appear in 8th Internation Conference
of Trust & Trustworthy Computing, TRUST 2015, Heraklion, Crete, Greece,
August 24-26, 201
Fast and Lean Immutable Multi-Maps on the JVM based on Heterogeneous Hash-Array Mapped Tries
An immutable multi-map is a many-to-many thread-friendly map data structure
with expected fast insert and lookup operations. This data structure is used
for applications processing graphs or many-to-many relations as applied in
static analysis of object-oriented systems. When processing such big data sets
the memory overhead of the data structure encoding itself is a memory usage
bottleneck. Motivated by reuse and type-safety, libraries for Java, Scala and
Clojure typically implement immutable multi-maps by nesting sets as the values
with the keys of a trie map. Like this, based on our measurements the expected
byte overhead for a sparse multi-map per stored entry adds up to around 65B,
which renders it unfeasible to compute with effectively on the JVM.
In this paper we propose a general framework for Hash-Array Mapped Tries on
the JVM which can store type-heterogeneous keys and values: a Heterogeneous
Hash-Array Mapped Trie (HHAMT). Among other applications, this allows for a
highly efficient multi-map encoding by (a) not reserving space for empty value
sets and (b) inlining the values of singleton sets while maintaining a (c)
type-safe API.
We detail the necessary encoding and optimizations to mitigate the overhead
of storing and retrieving heterogeneous data in a hash-trie. Furthermore, we
evaluate HHAMT specifically for the application to multi-maps, comparing them
to state-of-the-art encodings of multi-maps in Java, Scala and Clojure. We
isolate key differences using microbenchmarks and validate the resulting
conclusions on a real world case in static analysis. The new encoding brings
the per key-value storage overhead down to 30B: a 2x improvement. With
additional inlining of primitive values it reaches a 4x improvement
- …