91 research outputs found

    Memory system architecture for real-time multitasking systems

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1995.Includes bibliographical references (p. 119-120).by Scott Rixner.M.Eng

    GMEM: Generalized Memory Management for Peripheral Devices

    Full text link
    This paper presents GMEM, generalized memory management, for peripheral devices. GMEM provides OS support for centralized memory management of both CPU and devices. GMEM provides a high-level interface that decouples MMU-specific functions. Device drivers can thus attach themselves to a process's address space and let the OS take charge of their memory management. This eliminates the need for device drivers to "reinvent the wheel" and allows them to benefit from general memory optimizations integrated by GMEM. Furthermore, GMEM internally coordinates all attached devices within each virtual address space. This drastically improves user-level programmability, since programmers can use a single address space within their program, even when operating across the CPU and multiple devices. A case study on device drivers demonstrates these benefits. A GMEM-based IOMMU driver eliminates around seven hundred lines of code and obtains 54% higher network receive throughput utilizing 32% less CPU compared to the state-of-the-art. In addition, the GMEM-based driver of a simulated GPU takes less than 70 lines of code, excluding its MMU functions.Comment: Finished before Weixi left Rice and submitted to ASPLOS'2

    A Bandwidth-efficient Architecture for a Streaming Media Processor

    No full text
    Media processing applications, such as three-dimensional graphics, video compression, and image processing, currently demand 10-100 billion operations per second of sustained computation. Fortunately, hundreds of arithmetic units can easily fit on a modestly sized 1cm 2 chip in modern VLSI. The challenge is to provide these arithmetic units with enough data to enable them to meet the computation demands of media applications. Conventional storage hierarchies, which frequently include caches, are unable to bridge the data bandwidth gap between modern DRAM and tens to hundreds of arithmetic units. A data bandwidth hierarchy, however, can bridge this gap by scaling the provided bandwidth across the levels of the storage hierarchy. The stream programming model enables media processing applications to exploit a data bandwidth hierarchy effectively. Media processing applications can naturally be expressed as a sequence of computation kernels that operate on data streams. This programming ..

    PROGRAMMING The Owl Embedded Python Environment Microcontroller Development for the Modern World

    No full text
    research in computer architecture, embedded systems software, and high-performance computing. Outside of graduate school, he has worked as an expert witness and in litigation support for intellectual property

    Comparing Ethernet and Myrinet for MPI communication

    No full text
    This paper compares the performance of Myrinet and Eth-ernet as a communication substrate for MPI libraries. MPI library implementations for Myrinet utilize user-level com-munication protocols to provide low latency and high band-width MPI messaging. In contrast, MPI library impleme-nations for Ethernet utilize the operating system network protocol stack, leading to higher message latency and lower message bandwidth. However, on the NAS benchmarks, GM messaging over Myrinet only achieves 5 % higher applica-tion performance than TCP messaging over Ethernet. Fur-thermore, efficient TCP messaging implmentations improve communication latency tolerance, which closes the perfor-mance gap between Myrinet and Ethernet to about 0.5% on the NAS benchmarks. This shows that commodity net-working, if used efficiently, can be a viable alternative to specialized networking for high-performance message pass-ing. 1

    RiceNIC: A reconfigurable network interface for experimental research and education

    No full text
    The evaluation of new network server architectures is usually performed experimentally using either a simulator or a hardware prototype. Accurate simulation of the hardware-software interface within the network subsystem is challenging due to the interactions of multiple asynchronous systems. Small timing inaccuracies in such a system can perturb the hardware and software state yielding potentially misleading results. Hardware prototypes show more promise because they are real-world implementations, not simplifications. Existing Ethernet network interface cards (NICs) are unsuitable for prototyping as they lack the capability and/or flexibility for advanced networking research. RiceNIC is an open network interface prototyping platform for public use. This reconfigurable and programmable Gigabit Ethernet NIC is designed to address the dilemma of how to accurately evaluate new ideas in network server architecture, and is built for use in experimental research and education. The flexibility and capability of RiceNIC has proven invaluable in recent research efforts. Copyright 2007 ACM

    Exploiting Task-Level Concurrency in a Programmable Network Interface

    No full text
    Conference PaperProgrammable network interfaces provide the potential to extend the functionality of network services but lead to instruction processing overheads when compared to application-specific network interfaces. This paper aims to offset those performance disadvantages by exploiting task-level concurrency in the workload to parallelize the network interface firmware for a programmable controller with two processors. By carefully partitioning the handler procedures that process various events related to the progress of a packet, the system can minimize sharing, achieve load balance, and efficiently utilize on-chip storage. Compared to the uniprocessor firmware released by the manufacturer, the parallelized network interface firmware increases throughput by 65% for bidirectional UDP traffic of maximum-sized packets, 157% for bidirectional UDP traffic of minimum-sized packets, and 32-107% for real network services. This parallelization results in performance within 10-20% of a modern ASIC-based network interface for real network services.National Science Foundatio

    Performance Characterization of the FreeBSD Network Stack

    No full text
    This paper analyzes the behavior of high-performance web servers along three axes: packet rate, number of connections, and communication latency. Modern, high-performance servers spend a significant fraction of time executing the network stack of the operating system---over 80% of the time fora web server. These servers must handle increasing packet rates, increasing numbers of connections, and the long round trip times of the Internet. Low overhead, non-statistical profiling shows that a large number of connections and long latencies degrade instruction throughput of the operating system network stack significantly. This degradation results from a dramatic increase in L2 cache capacity misses because the working set size of connection data structures grows in proportion to the number of connections and their reuse decreases as communication latency increases. For instance, L2 cache misses increase the number of cycles spent executing the TCP layer of the network stack by over 300% from 1312 cycles per packet to 5364. The obvious solutions of increasing the L2 cache size or using prefetching to reduce the number of misses are surprisingly ineffective

    Increasing Web Server Throughput with Network Interface Data Caching

    No full text
    Conference PaperThis paper introduces network interface data caching, a new technique to reduce local interconnect traffic on networking servers by caching frequently-requested content on a programmable network interface. The operating system on the host CPU determines which data to store in the cache and for which packets it should use data from the cache. To facilitate data reuse across multiple packets and connections, the cache only stores application-level response content (such as HTTP data), with application-level and networking headers generated by the host CPU. Network interface data caching can reduce PCI traffic by up to 57% on a prototype implementation of a uniprocessor web server. This traffic reduction results in up to 31% performance improvement, leading to a peak server throughput of 1571 Mb/s.National Science Foundatio
    corecore