9 research outputs found

    Context Switching with Multiple Register Windows: A RISC Performance Study

    Get PDF
    Although previous studies have shown that a large file of overlapping register windows can greatly reduce procedure call/return overhead, the effects of register windows in a multiprogramming environment are poorly understood. This paper investigates the performance of multiprogrammed, reduced instruction set computers (RISCs) as a function of window management strategy. Using an analytic model that reflects context switch and procedure call overheads, we analyze the performance of simple, linearly self-recursive programs. For more complex programs, we present the results of a simulation study. These studies show that a simple strategy that saves all windows prior to a context switch, but restores only a single window following a context switch, performs near optimally

    The Susceptibility of Programs to Context Switching

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Science Foundation / MIP-8809478NCR Corp.AMD Corp. 29K Advanced Processor Development DivisionNational Aeronautics and Space Administration / NASA NAG 1-613Office of Naval Research / N00014-88-K-0656Hewlett-Packard Co

    Profile-Guided Automatic Inline Expansion for C Programs

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Science Foundation / MIP-8809478NCRAMD 29K Advanced Processor Development DivisionNational Aeronautics and Space Administration / NASA NAG 1-61

    Efficient Instruction Sequencing with Inline Target Insertion

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Science Foundation / MIP-8809478NCRNational Aeronautics and Space Administration / NASA NAG 1-613Office of Naval Research / N00014-88-K-065

    Software and hardware methods for memory access latency reduction on ILP processors

    Get PDF
    While microprocessors have doubled their speed every 18 months, performance improvement of memory systems has continued to lag behind. to address the speed gap between CPU and memory, a standard multi-level caching organization has been built for fast data accesses before the data have to be accessed in DRAM core. The existence of these caches in a computer system, such as L1, L2, L3, and DRAM row buffers, does not mean that data locality will be automatically exploited. The effective use of the memory hierarchy mainly depends on how data are allocated and how memory accesses are scheduled. In this dissertation, we propose several novel software and hardware techniques to effectively exploit the data locality and to significantly reduce memory access latency.;We first presented a case study at the application level that reconstructs memory-intensive programs by utilizing program-specific knowledge. The problem of bit-reversals, a set of data reordering operations extensively used in scientific computing program such as FFT, and an application with a special data access pattern that can cause severe cache conflicts, is identified in this study. We have proposed several software methods, including padding and blocking, to restructure the program to reduce those conflicts. Our methods outperform existing ones on both uniprocessor and multiprocessor systems.;The access latency to DRAM core has become increasingly long relative to CPU speed, causing memory accesses to be an execution bottleneck. In order to reduce the frequency of DRAM core accesses to effectively shorten the overall memory access latency, we have conducted three studies at this level of memory hierarchy. First, motivated by our evaluation of DRAM row buffer\u27s performance roles and our findings of the reasons of its access conflicts, we propose a simple and effective memory interleaving scheme to reduce or even eliminate row buffer conflicts. Second, we propose a fine-grain priority scheduling scheme to reorder the sequence of data accesses on multi-channel memory systems, effectively exploiting the available bus bandwidth and access concurrency. In the final part of the dissertation, we first evaluate the design of cached DRAM and its organization alternatives associated with ILP processors. We then propose a new memory hierarchy integration that uses cached DRAM to construct a very large off-chip cache. We show that this structure outperforms a standard memory system with an off-level L3 cache for memory-intensive applications.;Memory access latency has become a major performance bottleneck for memory-intensive applications. as long as DRAM technology remains its most cost-effective position for making main memory, the memory performance problem will continue to exist. The studies conducted in this dissertation attempt to address this important issue. Our proposed software and hardware schemes are effective and applicable, which can be directly used in real-world memory system designs and implementations. Our studies also provide guidance for application programmers to understand memory performance implications, and for system architects to optimize memory hierarchies

    A Characterization of Processor Performance in the VAX-11/780

    No full text
    This paper reports the results of a study of VAX-11/780 processor performance using a novel hardware monitoring technique. A micro-PC histogram monitor was built for these measurements. It keeps a count of the number of microcode cycles executed at each microcode location. Measurement experiments were performed on live timesharing workloads as well as on synthetic workloads of several types. The histogram counts allow the calculation of the frequency of various architectural events, such as the frequency of different types of opcodes and operand specifiers, as well as the frequency of some implementation-specific events, such as translation buffer misses. The measurement technique also yields the amount of processing time spent in various activities, such as ordinary microcode computation, memory management, and processor stalls of different kinds. This, paper reports in detail the amount of time the 'average'fVAX instruction spends in these activities. 1

    A Characterization of Processor Performance in the vax-11/780

    No full text

    A Characterization of Processor Performance in the vax-11/780

    No full text