143 research outputs found

    Optimizing Network Virtualization in Xen

    Get PDF
    BEST PAPER AWARDIn this paper, we propose and evaluate three techniques for optimizing network performance in the Xen virtualized environment. Our techniques retain the basic Xen architecture of locating device drivers in a privileged `driver' domain with access to I/O devices, and providing network access to unprivileged `guest' domains through virtualized network interfaces. First, we redefine the virtual network interfaces of guest domains to incorporate high-level network offfload features available in most modern network cards. We demonstrate the performance benefits of high-level offload functionality in the virtual interface, even when such functionality is not supported in the underlying physical interface. Second, we optimize the implementation of the data transfer path between guest and driver domains. The optimization avoids expensive data remapping operations on the transmit path, and replaces page remapping by data copying on the receive path. Finally, we provide support for guest operating systems to effectively utilize advanced virtual memory features such as superpages and global page mappings. The overall impact of these optimizations is an improvement in transmit performance of guest domains by a factor of 4.4. The receive performance of the driver domain is improved by 35% and reaches within 7% of native Linux performance. The receive performance in guest domains improves by 18%, but still trails the native Linux performance by 61%. We analyse the performance improvements in detail, and quantify the contribution of each optimization to the overall performance

    Lazy Asynchronous I/O for Event-Driven Servers

    Get PDF
    In this paper , we introduce Lazy Asynchronous I/O (LAIO),a new API for performing I/O that is well-suited but not limited to the needs of high-performance, event-driven servers. In addition, we describe and evaluate an implementation of LAIO that demonstrably addresses certain critical limitations of the asynchronous and non-blocking I/O support in present Unix-like systems. LAIO is implemented entirely at user-level, without modification to the operating system’s kernel. It utilizes scheduler activations. Using a micro-benchmark, LAIO was shown to be more than 3 times faster than AIO when the data base was already available in memory. It also had a comparable performance to AIO when actual I/O needed to be made. An event driven web server (thttpd) archived more than 38% increase in its throughput using LAIO. The Flash web server’s throughput, originally archived with kernel modification, was matched using LAIO without making kernel modification

    Causeway: Support for Controlling and Analyzing the Execution of Web-Accessible Applications

    Get PDF
    Causeway provides runtime support for the development of distributed meta-applications. These meta-applications control or analyze the behavior of multi-tier distributed applications such as multi-tier web sites or web services. Examples of meta-applications include multi-tier debugging, fault diagnosis, resource tracking, prioritization, and security enforcement. Efficient online implementation of these meta-applications requires meta-data to be passed between the different program components. Examples of metadata corresponding to the above meta-applications are request identifiers, priorities or security principal identifiers. Causeway provides the infrastructure for injecting, destroying, reading, and writing such metadata. The key functionality in Causeway is forwarding the metadata associated with a request at so-called transfer points, where the execution of that request gets passed from one component to another. This is done automatically for system-visible channels, such as pipes or sockets. An API is provided to implement the forwarding of metadata at system-opaque channels such as shared memory. We describe the design and implementation of Causeway, and we evaluate its usability and performance. Causeway’s low overhead allows it to be present permanently in production systems. We demonstrate its usability by showing how to implement, in 150 lines of code and without modification to the application, global priority enforcement in a multi-tier dynamic web server

    Software DSM protocols that adapt between single writer and multiple writer

    Get PDF
    We present two software DSM protocols that dynamically adapt between a single writer (SW) and a multiple writer (MW) protocol based on the application's sharing patterns. The first protocol (WFS) adapts based on write-write false sharing; the second (WFS+WG) based on a combination of write-write false sharing and write granularity. The adaptation is automatic. No user or compiler information is needed. The choice between SW and MW is made on a per-page basis. We measured the performance of our adaptive protocols on an 8-node SPARC cluster connected by a 155 Mbps ATM network. We used eight applications, covering a broad spectrum in terms of write-write false sharing and write granularity. We compare our adaptive protocols against the MW-only and the SW-only approach. Adaptation to write-write false sharing proves to be the critical performance factor, while adaptation to write granularity plays only a secondary role in our environment and for the applications considered. Each of the two adaptive protocols matches or exceeds the performance of the best of MW and SW in seven out of the eight application

    Evaluating the performance of software distributed shared memory as a target for parallelizing compilers

    Get PDF
    In this paper we evaluate the use of software distributed shared memory (DSM) on a message passing machine as the target for a parallelizing compiler. We compare this approach to compiler-generated message passing, hand-coded software DSM and hand-coded message passing. For this comparison, we use six applications: four that are regular and two that are irregular: Our results are gathered on an 8-node IBM SP/2 using the TreadMarks software DSM system. We use the APR shared-memory (SPF) compiler to generate the shared memory-programs and the APR XHPF compiler to generate message passing programs. The hand-coded message passing programs run with the IBM PVMe optimized message passing library. On the regular programs, both the compiler-generated and the hand-coded message passing outperform the SPF/TreadMarks combination: the compiler-generated message passing by 5.5% to 40%, and the hand-coded message passing by 7.5% to 49%. On the irregular programs, the SPF/TreadMarks combination outperforms the compiler-generated message passing by 38% and 89%, and only slightly underperforms the hand-coded message passing, differing by 4.4% and 16%. We also identify the factors that account for the performance differences, estimate their relative importance, and describe methods to improve the performanc

    Software vs. Hardware Shared Memory Implementation: A Case Study

    Get PDF
    We compare the performance of software-supported shared memory on a general-purpose network to hardware-supported shared memory on a dedicated interconnect. Up to eight processors, our results are based on the execution of a set of application programs on a SGI 4D/480 multiprocessor and on TreadMarks, a distributed shared memory system that runs on a Fore ATM LAN of DECstation-5000/240s. Since the DEC-station and the 4D/480 use the same processor, primary cache, and compiler, the shared-memory implementation is the principal difference between the systems. Our results show that TreadMarks performs comparably to the 4D/480 for applications with moderate amounts of synchronization, but the difference in performance grows as the synchronization frequency increases. For applications that require a large amount of memory bandwidth, TreadMarks can perform better than the SGI 4D/480. Beyond eight processors, our results are based on execution-driven simulation. Specifically, we compare a software implementation on a general-purpose network of uniprocessor nodes, a hardware implementation using a directory-based protocol on a dedicated interconnect, and a combined implementation using software to provide shared memory between multiprocessor nodes with hardware implementing shared memory within a node. For the modest size of the problems that we can simulate, the hardware implementation scales well and the software implementation scales poorly. The combined approach delivers performance close to that of the hardware implementation for applications with small to moderate synchronization rates and good locality. Reductions in communication overhead improve the performance of the software and the combined approach, but synchronization remains a bottleneck
    • …
    corecore