14 research outputs found

    Network virtual memory

    No full text
    User-mode access, zero-copy transfer, and sender-managed communication have emerged as essential for improving communication performance in workstation and PC clusters. The goal of these techniques is to provide application-level DAAA to remote memory. Achieving this goal is difficult, however, because the network interface accesses physical rather than virtual memory. As a result, previous systems have confined source and destination data to pages in pinned physical memory. Unfortunately, this approach increases application complexity and reduces memory-management effectiveness. This thesis describes the design and implementation of NetVM, which is a network interface that supports user-mode access, zero-copy transfer and sender-managed communication without pinning source or destination memory. To do this, the network interface maintains a shadow page table, which the host operating system updates whenever it maps or unmaps a page in host memory. The network interface uses this table to briefly lock and translate the virtual address of a page when it accesses that page for DAAA transfer. The operating system is prevented from replacing a page in the short interval that the network interface has locked that page. If a destination page is not resident in memory, the network interface redirects the data to an intermediate system buffer, which the operating system uses to complete the transfer with a single host-to-host memory copy after fetching in the required page. A credit-based flow-control scheme prevents the system buffer from overflowing. Application-level DAAA transfers only data. To support control transfers, NetVM implements a counter-based notification mechanism for applications to issue and detect notifications. The sending application increments an event counter by specifying its identifier in an RDAAA write operation. The receiving application detects this event by busy waiting, block waiting or triggering a user-defined handler whenever the notifying write completes. This range of detection mechanisms allows the application to decide appropriate tradeoffs between reducing signaling latency and reducing processor overhead. NetVM enforces ordered notifications over an out-oforder delivery network by using a sequence window. NetVM supports efficient mutual-exclusion, wait-queue and semaphore synchronization implementations. It augments the network interface with atomic operation primitives, which have low overhead, to provide MCS-lock-inspired scalable and efficient high-level synchronization for applications. As a result, these operations require lower latency and fewer network transactions to complete compared with the traditional implementations. The NetVM prototype is implemented in firmware for the Myrinet LANai-9.2 and integrated with the FreeBSD 4.6 virtual memory system. NetVM's memory-management overhead is low; it adds only less than 5.0% write latency compared to a static pinning approach and has a lower pinning cost compared to a dynamic pinning approach that has up to 94.5% hit rate in the pinnedpage cache. Minimum write latency is 5.56us and maximum throughput is 155.46MB/s, which is 97.2% of the link bandwidth. Transferring control through notification adds between 2.96us and 17.49us to the write operation, depending on the detection mechanism used. Compared to standard low-level atomic operations, NetVM adds only up to 18.2% and 12.6% to application latencies for high-level wait-queue and counting-semaphore operations respectively.Science, Faculty ofComputer Science, Department ofGraduat

    Integrating virtual memory with user-level network communication

    No full text

    Using Embedded Network Processors to Implement Global Memory Management in a Workstation Cluster

    No full text
    Advances in network technology continue to improve the communication performance of workstation and PC clusters, making high-performance workstation-cluster computing increasingly viable. These hardware advances, however, are taxing traditional host-software network protocols to the breaking point. A modern gigabit network can swamp a host's IO bus and processor, limiting communication performance and slowing computation unacceptably. Fortunately, host-programmable network processors used by these networks present a potential solution. Offloading selected host processing to these embedded network processors lowers host overhead and improves latency. This paper examines the use of embedded network processors to improve the performance of workstation-cluster global memory management. We have implemented a revised version of the GMS global memory system that eliminates host overhead by as much as 29% on active nodes and improves page fault latency by as much as 39%. 1. Introduction Adva..

    Using Idle Workstations to Implement Predictive Prefetching

    No full text
    The benefits of Markov-based predictive prefetching have been largely overshadowed by the overhead required to produce high quality predictions. While both theoretical and simulation results for prediction algorithms appear promising, substantial limitations exist in practice. This outcome can be partially attributed to the fact that practical implementations ultimately make compromises in order to reduce overhead. These compromises limit the level of algorithm complexity, the variety of access patterns, and the granularity of trace data the implementation supports. This paper describes the design and implementation of GMS-3P, an operating-system kernel extension that offloads prediction overhead to idle network nodes. GMS-3P builds on the GMS global memory system, which pages to and from remote workstation memory. In GMS-3P, the target node sends an on-line trace of an application’s page faults to an idle node that is running a Markov-based prediction algorithm. The prediction node then uses GMS to prefetch pages to the target node from the memory of other workstations in the network. Our preliminary results show that predictive prefetching can reduce remote-memory page fault time by 60 % or more and that by offloading prediction overhead to an idle node, GMS-3P can reduce this improved latency by between 24 % and 44%, depending on Markov-model order. 1

    Structuring operating system aspects

    No full text
    corecore