16 research outputs found

    Improving Cache Hits On Replacment Blocks Using Weighted LRU-LFU Combinations

    Get PDF
    Block replacement refers to the process of selecting a block of data or a cache line to be evicted or replaced when a new block needs to be brought into a cache or a memory hierarchy. In computer systems, block replacement policies are used in caching mechanisms, such as in CPU caches or disk caches, to determine which blocks are evicted when the cache is full and new data needs to be fetched. The combination of LRU (Least Recently Used) and LFU (Least Frequently Used) in a weighted manner is known as the "LFU2" algorithm. LFU2 is an enhanced caching algorithm that aims to leverage the benefits of both LRU and LFU by considering both recency and frequency of item access. In LFU2, each item in the cache is associated with two counters: the usage counter and the recency counter. The usage counter tracks the frequency of item access, while the recency counter tracks the recency of item access. These counters are used to calculate a combined weight for each item in the cache. Based on the experimental results, the LRU-LFU combination method succeeded in increasing cache hits from 94.8% on LFU and 95.5% on LFU to 96.6%

    Principled Approaches to Last-Level Cache Management

    Get PDF
    Memory is a critical component of all computing systems. It represents a fundamental performance and energy bottleneck. Ideally, memory aspects such as energy cost, performance, and the cost of implementing management techniques would scale together with the size of all different computing systems; unfortunately this is not the case. With the upcoming trends in applications, new memory technologies, etc., scaling becomes a bigger a problem, aggravating the performance bottleneck that memory represents. A memory hierarchy was proposed to alleviate the problem. Each level in the hierarchy tends to have a decreasing cost per bit, an increased capacity, and a higher access time compared to its previous level. Preferably all data will be stored in the fastest level of memory, unfortunately, faster memory technologies tend to be associated with a higher manufacturing cost, which often limits their capacity. The design challenge is, to determine which is the frequently used data, and store it in the faster levels of memory. A cache is a small, fast, on-chip chunk of memory. Any data stored in main memory can be stored in the cache. For many programs, a typical behavior is to access data that has been accessed previously. Taking advantage of this behavior, a copy of frequently accessed data is kept in the cache, in order to provide a faster access time next time is requested. Due to capacity constrains, it is likely that all of the frequently reused data cannot fit in the cache, because of this, cache management policies decide which data is to be kept in the cache, and which in other levels of the memory hierarchy. Under an efficient cache management policy, an encouraging amount of memory requests will be serviced from a fast on-chip cache. The disparity in access latency between the last-level cache and main memory motivates the search for efficient cache management policies. There is a great amount of recently proposed work that strives to utilize cache capacity in the most favorable to performance way possible. Related work focus on optimizing the performance of caches focusing on different possible solutions, e.g. reduce miss rate, consume less power, reducing storage overhead, reduce access latency, etc. Our work focus on improving the performance of last-level caches by designing policies based on principles adapted from other areas of interest. In this dissertation, we focus on several aspects of cache management policies, we first introduce a space-efficient placement and promotion policy which goal is to minimize the updates to the replacement policy state on each cache access. We further introduce a mechanism that predicts whether a block in the cache will be reused, it feeds different features from a block to the predictor in order to increase the correlation of a previous access to a future access. We later introduce a technique that tweaks traditional cache indexing, providing fast accesses to a vast majority of requests in the presence of a slow access memory technology such as DRAM

    Evaluating the Presence of a Victim Cache on an Arm Processor

    Get PDF
    Mobile processor is a CPU designed to save power. It is found in mobile computers and cell phones. A CPU chip, designed for portable computers, is typically housed in a smaller chip package, but more importantly, in order to run cooler, it uses lower voltages than its desktop counterpart and has more sleep mode capability. A mobile processor can be throttled down to different power levels and/or sections of the chip can be turned off entirely when not in use. ARM is a 32-bit reduced instruction set computer (RISC) instruction set architecture (ISA). The relative simplicity of ARM processors makes them suitable for low power applications. Hence ARM processors account for approximately 90% of all mobile 32-bit RISC processors. Today, mobile processors are expected to run complex, algorithm-heavy, memory-intensive applications which were originally designed and coded for general-purpose processors. Due to this we see a huge impact of the memory latencies on the execution time of applications. To reduce this impact and serve this kind of applications, the relative complexity of ARM processors has increased in the last decade by the inclusion of traditional methods like multiple issue of instructions, out-of-order instruction execution and large, associative caches. Victim Caching is another method which can be used to reduce the execution time and is currently not incorporated in the ARM processors. This method was proposed by Norman P. Jouppi in his paper “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers”. Victim Cache is defined as an extension to a direct mapped cache that adds a small, secondary, fully associative cache to store cache blocks that have been ejected from the main cache due to a capacity or conflict miss. These ejected blocks are likely to be needed again so storing them in the secondary cache should increase performance and reduce the execution times. Therefore for the Master\u27s project we re-implemented the SimpleScalar simulator for an ARM processor by incorporating the impact of Victim Cache. This re-implementation of the ARM simulator gave a significant improvement in the performance when various applications of MIBench benchmark suite were run on this simulator. It is observed to have a reduction of 1.93% in the number of clock cycles used and increase in the hit rate of Level 1cache by 2.7% over various Level 1 cache and Victim cache configurations on an average. It is also observed that the benefit of Victim cache increases as the size of Level 1 cache decreases and the performance boost obtained by the processor in presence of a Victim cache is comparable to the performance obtained when a Large, Associative Level 1 cache is used. Hence, incorporation of Victim Cache to an ARM processor is highly advantageous to the current generation of Mobile processors instead of using a Large, Associative Level 1 cache

    Exploring Alternate Cache Indexing Techniques

    Get PDF
    Cache memory is a bridging component which covers the increasing gap between the speed of a processor and main memory. An excellent performance of the cache is crucial to improve system performance. Conflict misses are one of the critical reasons that limit the cache performance by mapping blocks to the same set which results in the eviction of many blocks. However, many blocks in the cache sets are not mapped, and thus the available space is not efficiently utilized. A direct way to reduce conflict misses is to increase associativity, but this comes with the cost of an increase in the hit time. Another way to reduce conflict misses is to change the cache-indexing scheme and distribute the accesses across all sets. This thesis focuses on the second way mentioned above and aims to evaluate the impact of the matrix-based indexing scheme on cache performance against the traditional modulus-based indexing scheme. A correlation between the proposed indexing scheme and different cache replacement policies is also observed. The matrix-based indexing scheme yields a geometric mean speedup of 1.2% for SPEC CPU 2017 benchmarks for single core simulations when applied for direct-mapped last level cache. In this case, an improvement of 1.5% and 4% is observed for at least eighteen and seven of SPEC CPU2017 applications respectively. Also, it yields 2% of performance improvement over sixteen SPEC CPU2006 benchmarks. The new indexing scheme correlates well with multiperspective reuse prediction. It is observed that LRU benefits machine learning benchmark by a performance of 5.1%. For multicore simulations, the new indexing scheme does not improve performance significantly. However, this scheme also does not impact the application’s performance negatively

    Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

    Get PDF
    abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Speculative Techniques for Memory Hierarchy Management

    Get PDF
    The “Memory Wall” [1], is the gap in performance between the processor and the main memory. Over the last 30 years computer architects have added multiple levels of cache to fill this gap, cache levels that are closer to the processors are smaller and faster. On the other hand, the levels that are far from the processors are bigger and slower. However the processors are still exposed to the latency of DRAM on misses. Therefore, speculative memory management techniques such as prefetching are used in modern microprocessors to bridge this gap in performance. First, we propose Synchronization-aware Hardware Prefetching for Chip Multiprocessors, a novel hardware data prefetching scheme designed for prefetching shared-memory, multi- threaded workloads. This is the first work we are aware of to characterize the causes of poor prefetching performance in shared- memory multi-threaded applications. These are the inability to prefetch beyond synchronization points and tendency to prefetch shared data before it has been written. SB-Fetch, a low-complexity, low-overhead prefetcher design that addresses both issues. Second, we propose a new prefetching algorithm, Set-Level Adaptive Prefetching for Com- pressed Caches (SLAP-CC), which seeks to address this problem by varying the prefetching aggressiveness based on how much effective capacity is available in each set. The ontribu- tions of this work is characterize the increase and per-set variability of cache efficiency which typical cache compression schemes create, and propose a new prefetching scheme, SLAP-CC, designed to leverage this cache efficiency variability. Third, we propose a new a scheduling mechanism that predicts the hard- to-prefetch loads at issue time and preemptively schedule them for execution as soon as they are ready, to allow the cache hierarchy to start the mishandling mechanism sooner. Such scheduling mechanism reduces the miss penalty on the dependent instructions after a hard-to-prefetch loads

    Exploring Alternate Cache Indexing Techniques

    Get PDF
    Cache memory is a bridging component which covers the increasing gap between the speed of a processor and main memory. An excellent performance of the cache is crucial to improve system performance. Conflict misses are one of the critical reasons that limit the cache performance by mapping blocks to the same set which results in the eviction of many blocks. However, many blocks in the cache sets are not mapped, and thus the available space is not efficiently utilized. A direct way to reduce conflict misses is to increase associativity, but this comes with the cost of an increase in the hit time. Another way to reduce conflict misses is to change the cache-indexing scheme and distribute the accesses across all sets. This thesis focuses on the second way mentioned above and aims to evaluate the impact of the matrix-based indexing scheme on cache performance against the traditional modulus-based indexing scheme. A correlation between the proposed indexing scheme and different cache replacement policies is also observed. The matrix-based indexing scheme yields a geometric mean speedup of 1.2% for SPEC CPU 2017 benchmarks for single core simulations when applied for direct-mapped last level cache. In this case, an improvement of 1.5% and 4% is observed for at least eighteen and seven of SPEC CPU2017 applications respectively. Also, it yields 2% of performance improvement over sixteen SPEC CPU2006 benchmarks. The new indexing scheme correlates well with multiperspective reuse prediction. It is observed that LRU benefits machine learning benchmark by a performance of 5.1%. For multicore simulations, the new indexing scheme does not improve performance significantly. However, this scheme also does not impact the application’s performance negatively
    corecore