6 research outputs found
Memory Management Support for Multi-Programmed Remote Direct Memory Access (RDMA) Systems
Current operating systems offer basic support for network interface controllers (NICs) supporting remote direct memory access (RDMA). Such support typically consists of a device driver responsible for configuring communication channels between the device and user-level processes but not involved in data transfer. Unlike standard NICs, RDMA-capable devices incorporate significant memory resources for address translation purposes. In a multi-programmed operating system (OS) environment, these memory resources must be efficiently shareable by multiple processes. For such sharing to occur in a fair manner, the OS and the device must cooperate to arbitrate access to NIC memory, similar to the way CPUs and OSes cooperate to arbitrate access to translation lookaside buffers (TLBs) or physical memory. A problem with this approach is that today’s RDMA NICs are not integrated into the functions provided by OS memory management systems. As a result, RDMA NIC hardware resources are often monopolized by a single application. In this paper, I propose two practical mechanisms to address this problem: (a) Use of RDMA only in kernel-resident I/O subsystems, transparent to user-level software; (b) An extended registration API and a kernel upcall mechanism delivering NIC TLB entry replacement notifications to user-level libraries. Both options are designed to re-instate the multiprogramming principles that are violated in early commercial RDMA systems
Recommended from our members
Making the Most out of Direct-Access Network Attached Storage
The performance of high-speed network-attached storage applications is often limited by end-system overhead,
caused primarily by memory copying and network protocol processing. In this paper, we examine alternative strategies for reducing overhead in such systems.
We consider optimizations to remote procedure call (RPC)-based data transfer using either remote direct memory access (RDMA) or network interface support
for pre-posting of application receive buffers. We demonstrate that both mechanisms enable file access throughput that saturates a 2Gb/s network link when
performing large I/Os on relatively slow, commodity PCs. However, for multi-client workloads dominated by small I/Os, throughput is limited by the per-I/O overhead of processing RPCs in the server. For such workloads, we propose the use of a new network I/O mechanism, Optimistic RDMA (ORDMA). ORDMA is an alternative to RPC that aims to improve server throughput and response time for small I/Os. We measured performance improvements of up to 32% in server throughput and 36% in response time with use of ORDMA in our prototype.Engineering and Applied Science
MP-LOCKs: Replacing hardware synchronization primitives with message passing
Journal ArticleShared memory programs guarantee the correctness of concurrent accesses to shared data using interprocessor synchronization operations. The most common synchronization operators are locks, which are traditionally implemented in user-level libraries via a mix of shared memory accesses and hardware synchronization primitives like test-and-set. In this paper, we argue that synchronization operations implemented using fast message passing and kernel-embedded lock managers are an attractive alternative to dedicated synchronization hardware. We propose three message passing lock (MP-LOCK) algorithms (centralized, distributed, and reactive) and provide guidelines for implementing them efficiently. MP-LOCKs redice tje design complexity and runtime occupancy of DSM controllers and can exploit software's inherent flexibility to adapt to differing applications lock access patterns. We compared the performance of MP-LOCKs with two common shared memory lock algorithms: test-and-set and MCS locks and found that MP-LOCKs scale better. For machines with 16 to 32 nides, applications using MP-LOCKs ran up to 186% faster than the same applications with shared memory locks. For small systems (up to 8 nodes), MP-LOCK performance lags shared memory lock performance due to the higher software overhead. However, three of the MP-LOCK applications slow down by no more than 18%, while the other two slowed by no more than 180%. Given these results, we conclude that locks based on message passing should be considered as a replacement for hardware locks in future scalable multiprocessors that supports efficient message passing mechanisms. In addition, it is possible to implement efficient software synchronization primitives in clusters of workstations by using the guidelines we proposed
An Implementation Of The Hamlyn Sender-Managed Interface Architecture
Introduction Processors are rapidly getting faster, and message-passing multicomputer interconnections are doing the same, thanks to recent developments in Gb/s links and lowlatency packet switches. But the cost of passing messages between applications also includes the overhead of crossing interfaces between the operating system (OS), a device driver, and the hardware, which can be orders of magnitude more than the cost of moving a message's bits across the wires. Hamlyn is an architecture for processor-interconnection interfaces that addresses this difficulty. It achieves both low latency and high bandwidth, isolates applications from each other's mistakes, and supplies a rich set of message-delivery semantics. It does so by exploiting several techniques: . Sender-based memory management. Senders, not receivers, choose the destination memory address at which messages are deposited. This means that messages are sent only when the sender knows t
Efficient hardware for low latency applications
The design and development of application specific hardware structures has a
high degree of complexity. Logic resources are nowadays often not the limit
anymore, but the development time.
The first part presents a generator which allows defining control and status
structures for hardware designs using an abstract high level language.
A novel method to inform host systems very efficiently about changes in the
register files is presented in the second part. It makes use of a microcode
programmable hardware unit.
In the third part a fully pipelined address translation mechanism for remote
memory access in HPC interconnection networks is presented, which features a new
concept to resolve dependency problems.
The last part of this thesis addresses the problem of sending TCP messages for a
low latency trading application using a hybrid TCP stack implementation that
consists of hardware and software components. Furthermore, a simulation
environment for the TCP stack is presented
Integrated shared-memory and message-passing communication in the Alewife multiprocessor
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 237-246) and index.by John David Kubiatowicz.Ph.D