4 research outputs found
GMEM: Generalized Memory Management for Peripheral Devices
This paper presents GMEM, generalized memory management, for peripheral
devices. GMEM provides OS support for centralized memory management of both CPU
and devices. GMEM provides a high-level interface that decouples MMU-specific
functions. Device drivers can thus attach themselves to a process's address
space and let the OS take charge of their memory management. This eliminates
the need for device drivers to "reinvent the wheel" and allows them to benefit
from general memory optimizations integrated by GMEM. Furthermore, GMEM
internally coordinates all attached devices within each virtual address space.
This drastically improves user-level programmability, since programmers can use
a single address space within their program, even when operating across the CPU
and multiple devices. A case study on device drivers demonstrates these
benefits. A GMEM-based IOMMU driver eliminates around seven hundred lines of
code and obtains 54% higher network receive throughput utilizing 32% less CPU
compared to the state-of-the-art. In addition, the GMEM-based driver of a
simulated GPU takes less than 70 lines of code, excluding its MMU functions.Comment: Finished before Weixi left Rice and submitted to ASPLOS'2
Understanding PCIe performance for end host networking
In recent years, spurred on by the development and availability of programmable NICs, end hosts have increasingly become the enforcement point for core network functions such as load balancing, congestion control, and application specific network offloads. However, implementing custom designs on programmable NICs is not easy: many potential bottlenecks can impact performance.
This paper focuses on the performance implication of PCIe, the de-facto I/O interconnect in contemporary servers, when interacting with the host architecture and device drivers. We present a theoretical model for PCIe and pcie-bench, an open-source suite, that allows developers to gain an accurate and deep understanding of the PCIe substrate. Using pcie-bench, we characterize the PCIe subsystem in modern servers. We highlight surprising differences in PCIe implementations, evaluate the undesirable impact of PCIe features such as IOMMUs, and show the practical limits for common network cards operating at 40Gb/s and beyond. Furthermore, through pcie-bench we gained insights which guided software and future hardware
architectures for both commercial and research oriented network cards
and DMA engines
Recommended from our members
Exploitation from Malicious PCI Express Peripherals
The thesis of this dissertation is that, despite widespread belief in the security community, systems are still vulnerable to attacks from malicious peripherals delivered over the PCI Express (PCIe) protocol.
Malicious peripherals can be plugged directly into internal PCIe slots, or connected via an external Thunderbolt connection.
To prove this thesis, we designed and built a new PCIe attack platform.
We discovered that a simple platform was insufficient to carry out complex attacks, so created the first PCIe attack platform that runs a full, conventional OS.
To allows us to conduct attacks against higher-level OS functionality built on PCIe, we made the attack platform emulate in detail the behaviour of an Intel 82574L Network Interface Controller (NIC), by using a device model extracted from the QEMU emulator.
We discovered a number of vulnerabilities in the PCIe protocol itself, and with the way that the defence mechanisms it provides are used by modern OSs.
The principal defence mechanism provided is the Input/Output Memory Management Unit (IOMMU).
The remaps the address space used by peripherals in 4KiB chunks, and can prevent access to areas of address space that a peripheral should not be able to access.
We found that, contrary to belief in the security community, the IOMMUs in modern systems were not designed to protect against attacks from malicious peripherals, but to allow virtual machines direct access to real hardware.
We discovered that use of the IOMMU is patchy even in modern operating systems.
Windows effectively does not use the IOMMU at all; macOS opens windows that are shared by all devices; Linux and FreeBSD map windows into host memory separately for each device, but only if poorly documented boot flags are used.
These OSs make no effort to ensure that only data that should be visible to the devices is in the mapped windows.
We created novel attacks that subverted control flow and read private data against systems running macOS, Linux and FreeBSD with the highest level of relevant protection enabled.
These represent the first use of the relevant exploits in each case.
In the final part of this thesis, we evaluate the suitability of a number of proposed general purpose and specific mitigations against DMA attacks, and make a number of recommendations about future directions in IOMMU software and hardware.EPSRC and ARM iCASE Awar
Recommended from our members
Capability Memory Protection for Embedded Systems
This dissertation explores the use of capability security hardware and software in real-time and latency-sensitive embedded systems, to address existing memory safety and task isolation problems as well as providing new means to design a secure and scalable real-time system.
In addition, this dissertation looks into how practical and high-performance temporal memory safety can be achieved under a capability architecture.
State-of-the-art memory protection schemes for embedded systems typically present limited and inflexible solutions to memory protection and isolation, and fail to scale as embedded devices become more capable and ubiquitous.
I investigate whether a capability architecture is able to provide new angles to address memory safety issues in an embedded scenario.
Previous CHERI capability research focuses on 64-bit architectures in UNIX operating systems, which does not translate to typical 32-bit embedded processors with low-latency and real-time requirements.
I propose and implement the CHERI CC-64 encoding and the CHERI-64 coprocessor to construct a feasible capability-enabled 32-bit CPU.
In addition, I implement a real-time kernel for embedded systems atop CHERI-64.
On this hardware and software platform, I focus on exploring scalable task isolation and fine-grained memory protection enabled by capabilities in a single flat physical address space, which are otherwise difficult or impossible to achieve via state-of-the-art approaches.
Later, I present the evaluation of the hardware implementation and the software run-time overhead and real-time performance.
Even with capability support, CHERI-64 as well as other CHERI processors still expose major attack surfaces through temporal vulnerabilities like use-after-free.
A naive approach that sweeps memory to invalidate stale capabilities is inefficient and incurs significant cycle overhead and DRAM traffic.
To make sweeping revocation feasible, I introduce new architectural mechanisms and micro-architectural optimisations to substantially reduce the cost of memory sweeping and capability revocation.
Another factor of the cost is the frequency of memory sweeping.
I explore tradeoffs of memory allocator designs that use quarantine buffers and shadow space tags to prevent frequent unnecessary sweeping.
The evaluation shows that the optimisations and new allocator designs reduce the cost of capability sweeping revocation by orders of magnitude, making it already practical for most applications to adopt temporal safety under CHERI.CSC Cambridge Scholarshi