40 research outputs found

    ReD: A policy based on reuse detection for demanding block selection in last-level Caches

    Get PDF
    In this paper, we propose a new block selection policy for Last-Level Caches (LLCs) that decides, based on Reuse Detection, whether a block coming from main memory is inserted, or not, in the LLC. The proposed policy, called ReD, is demanding in the sense that blocks bypass the LLC unless their expected reuse behavior matches specific requirements, related either to their recent reuse history or to the behavior of associated instructions. Generally, blocks are only stored in the LLC the second time they are requested in a limited time window. Secondarily, some blocks enter the LLC on the first request if their associated requesting instruction has shown to request highly-reused blocks in the past. ReD includes two table structures that allow tracking, measuring and correlating reuse for specific block addresses and requesting program counters within a constrained storage budget. It can be implemented on top of any other base replacement algorithm. Other parts of the base replacement policy, such as promotion or victim selection, can remain unchanged, enabling our policy to work along with many state-of-the-art replacement algorithms.Peer ReviewedPostprint (published version

    Jenga: Harnessing Heterogeneous Memories through Reconfigurable Cache Hierarchies

    Get PDF
    Conventional memory systems are organized as a rigid hierarchy, with multiple levels of progressively larger and slower memories. Hierarchy allows a simple, fixed design to benefit a wide range of applications, because working sets settle at the smallest (and fastest) level they fit in. However, rigid hierarchies also cause significant overheads, because each level adds latency and energy even when it does not capture the working set. In emerging systems with heterogeneous memory technologies such as stacked DRAM, these overheads often limit performance and efficiency. We propose Jenga, a reconfigurable cache hierarchy that avoids these pathologies and approaches the performance of a hierarchy optimized for each application. Jenga monitors application behavior and dynamically builds virtual cache hierarchies out of heterogeneous, distributed cache banks. Jenga uses simple hardware support and a novel software runtime to configure virtual cache hierarchies. On a 36-core CMP with a 1 GB stacked-DRAM cache, Jenga outperforms a combination of state-of-the-art techniques by 10% on average and by up to 36%, and does so while saving energy, improving system-wide energy-delay product by 29% on average and by up to 96%

    Bridging Theory and Practice in Cache Replacement

    Get PDF
    Much prior work has studied processor cache replacement policies, but a large gap remains between theory and practice. The optimal policy (MIN) requires unobtainable knowledge of the future, and prior theoretically-grounded policies use reference models that do not match real programs. Meanwhile, practical policies are designed empirically. Lacking a strong theoretical foundation, they do not make the best use of the information available to them. This paper bridges theory and practice. We propose that practical policies should replace lines based on their economic value added (EVA), the difference of their expected hits from the average. We use Markov decision processes to show that EVA is optimal under some reasonable simplifications. We present an inexpensive, practical implementation of EVA and evaluate it exhaustively over many cache sizes. EVA outperforms prior practical policies and saves area at iso-performance. These results show that formalizing cache replacement yields practical benefits

    ????????? ?????? ????????? ?????? ?????? ????????? ?????? ??????

    Get PDF
    Department of Electrical EngineeringFor the last decade, many modern replacement policies for last-level cache (LLC) adopted Static Re-reference Interval Prediction (SRRIP) as their base algorithm. In the LLC, SRRIP outperforms other traditional replacement policies like Least-Recently Used (LRU). SRRIP works with a few bits of counter, called Re-Reference Prediction Value (RRPV), but we find that RRPV can be implemented with a binary tree. In this thesis, we propose a new cache replacement policy, Pseudo Re-Reference Interval Prediction (PRRIP). Our proposed PRRIP mimics SRRIP, so PRRIP outperforms other replacement policies such as LRU. Moreover, we find that PRRIP becomes more resistant to non-temporal data access pattern than SRRIP by using binary tree. In terms of overhead, we halve the hardware cost to implement PRRIP compared to SRRIP. Our experimental results show that PRRIP achieves 1.26% speedup over LRU while SRRIP gets 0.53% speedup over LRU for single-core workloads. For multi-core workloads, our experimental results show that the performance difference between PRRIP and SRRIP is less than 0.3%.clos

    Perceptron Learning in Cache Management and Prediction Techniques

    Get PDF
    Hardware prefetching is an effective technique for hiding cache miss latencies in modern processor designs. An efficient prefetcher should identify complex memory access patterns during program execution. This ability enables the prefetcher to read a block ahead of its demand access, potentially preventing a cache miss. Accurately identifying the right blocks to prefetch is essential to achieving high performance from the prefetcher. Prefetcher performance can be characterized by two main metrics that are generally at odds with one another: coverage, the fraction of baseline cache misses which the prefetcher brings into the cache; and accuracy, the fraction of prefetches which are ultimately used. An overly aggressive prefetcher may improve coverage at the cost of reduced accuracy. Thus, performance may be harmed by this over-aggressiveness because many resources are wasted, including cache capacity and bandwidth. An ideal prefetcher would have both high coverage and accuracy. In this thesis, I propose Perceptron-based Prefetch Filtering (PPF) as a way to increase the coverage of the prefetches generated by a baseline prefetcher without negatively impacting accuracy. PPF enables more aggressive tuning of a given baseline prefetcher, leading to increased coverage by filtering out the growing numbers of inaccurate prefetches such an aggressive tuning implies. I also explore a range of features to use to train PPF’s perceptron layer to identify inaccurate prefetches. PPF improves performance on a memory-intensive subset of the SPEC CPU 2017 benchmarks by 3.78% for a single-core configuration, and by 11.4% for a 4-core configuration, compared to the baseline prefetcher alone

    DSPatch: Dual Spatial Pattern Prefetcher

    Full text link
    High main memory latency continues to limit performance of modern high-performance out-of-order cores. While DRAM latency has remained nearly the same over many generations, DRAM bandwidth has grown significantly due to higher frequencies, newer architectures (DDR4, LPDDR4, GDDR5) and 3D-stacked memory packaging (HBM). Current state-of-the-art prefetchers do not do well in extracting higher performance when higher DRAM bandwidth is available. Prefetchers need the ability to dynamically adapt to available bandwidth, boosting prefetch count and prefetch coverage when headroom exists and throttling down to achieve high accuracy when the bandwidth utilization is close to peak. To this end, we present the Dual Spatial Pattern Prefetcher (DSPatch) that can be used as a standalone prefetcher or as a lightweight adjunct spatial prefetcher to the state-of-the-art delta-based Signature Pattern Prefetcher (SPP). DSPatch builds on a novel and intuitive use of modulated spatial bit-patterns. The key idea is to: (1) represent program accesses on a physical page as a bit-pattern anchored to the first "trigger" access, (2) learn two spatial access bit-patterns: one biased towards coverage and another biased towards accuracy, and (3) select one bit-pattern at run-time based on the DRAM bandwidth utilization to generate prefetches. Across a diverse set of workloads, using only 3.6KB of storage, DSPatch improves performance over an aggressive baseline with a PC-based stride prefetcher at the L1 cache and the SPP prefetcher at the L2 cache by 6% (9% in memory-intensive workloads and up to 26%). Moreover, the performance of DSPatch+SPP scales with increasing DRAM bandwidth, growing from 6% over SPP to 10% when DRAM bandwidth is doubled.Comment: This work is to appear in MICRO 201
    corecore