91 research outputs found
Memory system architecture for real-time multitasking systems
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1995.Includes bibliographical references (p. 119-120).by Scott Rixner.M.Eng
GMEM: Generalized Memory Management for Peripheral Devices
This paper presents GMEM, generalized memory management, for peripheral
devices. GMEM provides OS support for centralized memory management of both CPU
and devices. GMEM provides a high-level interface that decouples MMU-specific
functions. Device drivers can thus attach themselves to a process's address
space and let the OS take charge of their memory management. This eliminates
the need for device drivers to "reinvent the wheel" and allows them to benefit
from general memory optimizations integrated by GMEM. Furthermore, GMEM
internally coordinates all attached devices within each virtual address space.
This drastically improves user-level programmability, since programmers can use
a single address space within their program, even when operating across the CPU
and multiple devices. A case study on device drivers demonstrates these
benefits. A GMEM-based IOMMU driver eliminates around seven hundred lines of
code and obtains 54% higher network receive throughput utilizing 32% less CPU
compared to the state-of-the-art. In addition, the GMEM-based driver of a
simulated GPU takes less than 70 lines of code, excluding its MMU functions.Comment: Finished before Weixi left Rice and submitted to ASPLOS'2
A Bandwidth-efficient Architecture for a Streaming Media Processor
Media processing applications, such as three-dimensional graphics, video compression, and image processing, currently demand 10-100 billion operations per second of sustained computation. Fortunately, hundreds of arithmetic units can easily fit on a modestly sized 1cm 2 chip in modern VLSI. The challenge is to provide these arithmetic units with enough data to enable them to meet the computation demands of media applications. Conventional storage hierarchies, which frequently include caches, are unable to bridge the data bandwidth gap between modern DRAM and tens to hundreds of arithmetic units. A data bandwidth hierarchy, however, can bridge this gap by scaling the provided bandwidth across the levels of the storage hierarchy. The stream programming model enables media processing applications to exploit a data bandwidth hierarchy effectively. Media processing applications can naturally be expressed as a sequence of computation kernels that operate on data streams. This programming ..
PROGRAMMING The Owl Embedded Python Environment Microcontroller Development for the Modern World
research in computer architecture, embedded systems software, and high-performance computing. Outside of graduate school, he has worked as an expert witness and in litigation support for intellectual property
Comparing Ethernet and Myrinet for MPI communication
This paper compares the performance of Myrinet and Eth-ernet as a communication substrate for MPI libraries. MPI library implementations for Myrinet utilize user-level com-munication protocols to provide low latency and high band-width MPI messaging. In contrast, MPI library impleme-nations for Ethernet utilize the operating system network protocol stack, leading to higher message latency and lower message bandwidth. However, on the NAS benchmarks, GM messaging over Myrinet only achieves 5 % higher applica-tion performance than TCP messaging over Ethernet. Fur-thermore, efficient TCP messaging implmentations improve communication latency tolerance, which closes the perfor-mance gap between Myrinet and Ethernet to about 0.5% on the NAS benchmarks. This shows that commodity net-working, if used efficiently, can be a viable alternative to specialized networking for high-performance message pass-ing. 1
RiceNIC: A reconfigurable network interface for experimental research and education
The evaluation of new network server architectures is usually performed experimentally using either a simulator or a hardware prototype. Accurate simulation of the hardware-software interface within the network subsystem is challenging due to the interactions of multiple asynchronous systems. Small timing inaccuracies in such a system can perturb the hardware and software state yielding potentially misleading results. Hardware prototypes show more promise because they are real-world implementations, not simplifications. Existing Ethernet network interface cards (NICs) are unsuitable for prototyping as they lack the capability and/or flexibility for advanced networking research. RiceNIC is an open network interface prototyping platform for public use. This reconfigurable and programmable Gigabit Ethernet NIC is designed to address the dilemma of how to accurately evaluate new ideas in network server architecture, and is built for use in experimental research and education. The flexibility and capability of RiceNIC has proven invaluable in recent research efforts. Copyright 2007 ACM
Exploiting Task-Level Concurrency in a Programmable Network Interface
Conference PaperProgrammable network interfaces provide the potential to extend the functionality of network services but lead to instruction processing overheads when compared to application-specific network interfaces. This paper aims to offset those performance disadvantages by exploiting task-level concurrency in the workload to parallelize the network interface firmware for a programmable controller with two processors. By carefully partitioning the handler procedures that process various events related to the progress of a packet, the system can minimize sharing, achieve load balance, and efficiently utilize on-chip storage. Compared to the uniprocessor firmware released by the manufacturer, the parallelized network interface firmware increases throughput by 65% for bidirectional UDP traffic of maximum-sized packets, 157% for bidirectional UDP traffic of minimum-sized packets, and 32-107% for real network services. This parallelization results in performance within 10-20% of a modern ASIC-based network interface for real network services.National Science Foundatio
Performance Characterization of the FreeBSD Network Stack
This paper analyzes the behavior of high-performance web servers along three axes: packet rate, number of connections, and communication latency. Modern, high-performance servers spend a significant fraction of time executing the network stack of the operating system---over 80% of the time fora web server. These servers must handle increasing packet rates, increasing numbers of connections, and the long round trip times of the Internet. Low overhead, non-statistical profiling shows that a large number of connections and long latencies degrade instruction throughput of the operating system network stack significantly. This degradation results from a dramatic increase in L2 cache capacity misses because the working set size of connection data structures grows in proportion to the number of connections and their reuse decreases as communication latency increases. For instance, L2 cache misses increase the number of cycles spent executing the TCP layer of the network stack by over 300% from 1312 cycles per packet to 5364. The obvious solutions of increasing the L2 cache size or using prefetching to reduce the number of misses are surprisingly ineffective
Increasing Web Server Throughput with Network Interface Data Caching
Conference PaperThis paper introduces network interface data caching, a new technique to reduce local interconnect traffic on networking servers by caching frequently-requested content on a programmable network interface. The operating system on the host CPU determines which data to store in the cache and for which packets it should use data from the cache. To facilitate data reuse across multiple packets and connections, the cache only stores application-level response content (such as HTTP data), with application-level and networking headers generated by the host CPU. Network interface data caching can reduce PCI traffic by up to 57% on a prototype implementation of a uniprocessor web server. This traffic reduction results in up to 31% performance improvement, leading to a peak server throughput of 1571 Mb/s.National Science Foundatio
- …