394 research outputs found

    Fast Byte Copying: A Re-Evaluation of the Opportunities for Optimization

    Get PDF
    High-performance byte copying is important for many operating systems because it is the principle method used for transferring data between kernel and user protection domains. For example, byte copying is commonly used for transferring data from kernel buffers to user buffers during file system read and IPC recv calls and to kernel buffers from user buffers during \u27Write and-send calls. Because of its impact on overall system performance, commercial operating systems tend to employ many specialized byte copy routines, each one optimized for a different circumstance. This paper revisits the opportunities for optimizing byte copy performance by discussing a series of experiments run under HP-UX 9.03 on a range of Hewlett-Packard PA-RISC processors. First, we compare the performance improvements that result from several existing byte copy optimizations. Then we show that byte copy performance is dominated by cache effects that arise when source and target addresses overlap. Finally, we discuss the opportunities and difficulties associated with choosing appropriate source and target addresses to optimize byte copy performance

    Fast Byte Copying: A Re-Evaluation of the Opportunities for Optimization

    Get PDF
    High-performance byte copying is important for many operating systems because it is the principle method used for transferring data between kernel and user protection domains. For example, byte copying is commonly used for transferring data from kernel buffers to user buffers during file system read and IPC recv calls and to kernel buffers from user buffers during \u27Write and-send calls. Because of its impact on overall system performance, commercial operating systems tend to employ many specialized byte copy routines, each one optimized for a different circumstance. This paper revisits the opportunities for optimizing byte copy performance by discussing a series of experiments run under HP-UX 9.03 on a range of Hewlett-Packard PA-RISC processors. First, we compare the performance improvements that result from several existing byte copy optimizations. Then we show that byte copy performance is dominated by cache effects that arise when source and target addresses overlap. Finally, we discuss the opportunities and difficulties associated with choosing appropriate source and target addresses to optimize byte copy performance

    Reducing consistency traffic and cache misses in the avalanche multiprocessor

    Get PDF
    Journal ArticleFor a parallel architecture to scale effectively, communication latency between processors must be avoided. We have found that the source of a large number of avoidable cache misses is the use of hardwired write-invalidate coherency protocols, which often exhibit high cache miss rates due to excessive invalidations and subsequent reloading of shared data. In the Avalanche project at the University of Utah, we are building a 64-node multiprocessor designed to reduce the end-to-end communication latency of both shared memory and message passing programs. As part of our design efforts, we are evaluating the potential performance benefits and implementation complexity of providing hardware support for multiple coherency protocols. Using a detailed architecture simulation of Avalanche, we have found that support for multiple consistency protocols can reduce the time parallel applications spend stalled on memory operations by up to 66% and overall execution time by up to 31%. Most of this reduction in memory stall time is due to a novel release-consistent multiple-writer write-update protocol implemented using a write state buffer

    Analysis of avalanche's shared memory architecture

    Get PDF
    technical reportIn this paper, we describe the design of the Avalanche multiprocessor's shared memory subsystem, evaluate its performance, and discuss problems associated with using commodity workstations and network interconnects as the building blocks of a scalable shared memory multiprocessor. Compared to other scalable shared memory architectures, Avalanchehas a number of novel features including its support for the Simple COMA memory architecture and its support for multiple coherency protocols (migratory, delayed write update, and (soon) write invalidate). We describe the performance implications of Avalanche's architecture, the impact of various novel low-level design options, and describe a number of interesting phenomena we encountered while developing a scalable multiprocessor built on the HP PA-RISC platform

    A study of workstation computational performance for real-time flight simulation

    Get PDF
    With recent advances in microprocessor technology, some have suggested that modern workstations provide enough computational power to properly operate a real-time simulation. This paper presents the results of a computational benchmark, based on actual real-time flight simulation code used at Langley Research Center, which was executed on various workstation-class machines. The benchmark was executed on different machines from several companies including: CONVEX Computer Corporation, Cray Research, Digital Equipment Corporation, Hewlett-Packard, Intel, International Business Machines, Silicon Graphics, and Sun Microsystems. The machines are compared by their execution speed, computational accuracy, and porting effort. The results of this study show that the raw computational power needed for real-time simulation is now offered by workstations

    Message passing support in the Avalanche widget

    Get PDF
    Journal ArticleMinimizing communication latency in message passing multiprocessing systems is critical. An emerging problem in these systems is the latency contribution costs caused by the need to percolate the message through the memory hierarchy (at both sending and receiving nodes) and the additional cost of managing consistency within the hierarchy. This paper, considers three important aspects of these costs: cache coherence, message copying, and cache miss rates. The paper then shows via a simulation study how a design called the Widget can be used with existing commercial workstation technology to significantly reduce these costs to support efficient message passing in the Avalanche multiprocessing system
    corecore