494 research outputs found

    An investigation of memory latency reduction using an address prediction buffer

    Get PDF
    Developing memory systems to support high speed processes is a major challenge to computers architects. Cache memories can improve system performance but the latency of main memory remains a major penalty for a cache-miss. A novel approach to improve systems performance is the use of a memory prediction buffer. The memory prediction buffer (MPB) is inserted between the cache and main memory. The MPB predicts the next cache-miss address and pre-fetches the data. The use of an MPB in a computer system is shown to decrease main memory latency and increase system performance.http://archive.org/details/investigationofm00billLieutenant, United States NavyApproved for public release; distribution is unlimited

    Impulse: Memory System Support for Scientific Applications

    Get PDF

    Randomized cache placement for eliminating conflicts

    Get PDF
    Applications with regular patterns of memory access can experience high levels of cache conflict misses. In shared-memory multiprocessors conflict misses can be increased significantly by the data transpositions required for parallelization. Techniques such as blocking which are introduced within a single thread to improve locality, can result in yet more conflict misses. The tension between minimizing cache conflicts and the other transformations needed for efficient parallelization leads to complex optimization problems for parallelizing compilers. This paper shows how the introduction of a pseudorandom element into the cache index function can effectively eliminate repetitive conflict misses and produce a cache where miss ratio depends solely on working set behavior. We examine the impact of pseudorandom cache indexing on processor cycle times and present practical solutions to some of the major implementation issues for this type of cache. Our conclusions are supported by simulations of a superscalar out-of-order processor executing the SPEC95 benchmarks, as well as from cache simulations of individual loop kernels to illustrate specific effects. We present measurements of instructions committed per cycle (IPC) when comparing the performance of different cache architectures on whole-program benchmarks such as the SPEC95 suite.Peer ReviewedPostprint (published version

    Impulse: building a smarter memory controller

    Get PDF
    Journal ArticleImpulse is a new memory system architecture that adds two important features to a traditional memory controller. First, Impulse supports application-specific optimizations through configurable physical address remapping. By remapping physical addresses, applications control how their data is accessed and cached, improving their cache and bus utilization. Second, Impulse supports prefetching at the memory controller, which can hide much of the latency of DRAM accesses. In this paper we describe the design of the Impulse architecture, and show how an Impulse memory system can be used to improve the performance of memory-bound programs. For the NAS conjugate gradient benchmark, Impulse improves performance by 67%. Because it requires no modification to processor, cache, or bus designs, Impulse can be adopted in conventional systems. In addition to scientific applications, we expect that Impulse will benefit regularly strided, memory-bound applications of commercial importance, such as database and multimedia programs

    Effective instruction prefetching via fetch prestaging

    Get PDF
    As technological process shrinks and clock rate increases, instruction caches can no longer be accessed in one cycle. Alternatives are implementing smaller caches (with higher miss rate) or large caches with a pipelined access (with higher branch misprediction penalty). In both cases, the performance obtained is far from the obtained by an ideal large cache with one-cycle access. In this paper we present cache line guided prestaging (CLGP), a novel mechanism that overcomes the limitations of current instruction cache implementations. CLGP employs prefetching to charge future cache lines into a set of fast prestage buffers. These buffers are managed efficiently by the CLGP algorithm, trying to fetch from them as much as possible. Therefore, the number of fetches served by the main instruction cache is highly reduced, and so the negative impact of its access latency on the overall performance. With the best CLGP configuration using a 4 KB I-cache, speedups of 3.5% (at 0.09 /spl mu/m) and 12.5% (at 0.045 /spl mu/m) are obtained over an equivalent fetch directed prefetching configuration, and 39% (at 0.09 /spl mu/m) and 48% (at 0.045 /spl mu/m) over using a pipelined instruction cache without prefetching. Moreover, our results show that CLGP with a 2.5 KB of total cache budget can obtain a similar performance than using a 64 KB pipelined I-cache without prefetching, that is equivalent performance at 6.4X our hardware budget.Peer ReviewedPostprint (published version

    Evaluating the Presence of a Victim Cache on an Arm Processor

    Get PDF
    Mobile processor is a CPU designed to save power. It is found in mobile computers and cell phones. A CPU chip, designed for portable computers, is typically housed in a smaller chip package, but more importantly, in order to run cooler, it uses lower voltages than its desktop counterpart and has more sleep mode capability. A mobile processor can be throttled down to different power levels and/or sections of the chip can be turned off entirely when not in use. ARM is a 32-bit reduced instruction set computer (RISC) instruction set architecture (ISA). The relative simplicity of ARM processors makes them suitable for low power applications. Hence ARM processors account for approximately 90% of all mobile 32-bit RISC processors. Today, mobile processors are expected to run complex, algorithm-heavy, memory-intensive applications which were originally designed and coded for general-purpose processors. Due to this we see a huge impact of the memory latencies on the execution time of applications. To reduce this impact and serve this kind of applications, the relative complexity of ARM processors has increased in the last decade by the inclusion of traditional methods like multiple issue of instructions, out-of-order instruction execution and large, associative caches. Victim Caching is another method which can be used to reduce the execution time and is currently not incorporated in the ARM processors. This method was proposed by Norman P. Jouppi in his paper “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers”. Victim Cache is defined as an extension to a direct mapped cache that adds a small, secondary, fully associative cache to store cache blocks that have been ejected from the main cache due to a capacity or conflict miss. These ejected blocks are likely to be needed again so storing them in the secondary cache should increase performance and reduce the execution times. Therefore for the Master\u27s project we re-implemented the SimpleScalar simulator for an ARM processor by incorporating the impact of Victim Cache. This re-implementation of the ARM simulator gave a significant improvement in the performance when various applications of MIBench benchmark suite were run on this simulator. It is observed to have a reduction of 1.93% in the number of clock cycles used and increase in the hit rate of Level 1cache by 2.7% over various Level 1 cache and Victim cache configurations on an average. It is also observed that the benefit of Victim cache increases as the size of Level 1 cache decreases and the performance boost obtained by the processor in presence of a Victim cache is comparable to the performance obtained when a Large, Associative Level 1 cache is used. Hence, incorporation of Victim Cache to an ARM processor is highly advantageous to the current generation of Mobile processors instead of using a Large, Associative Level 1 cache

    Software management techniques for translation lookaside buffers

    Get PDF
    Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1995.Includes bibliographical references (leaves 67-70).by Kavita Bala.M.S
    • …
    corecore