111 research outputs found

    Reuse Detector: Improving the management of STT-RAM SLLCs

    Get PDF
    Various constraints of Static Random Access Memory (SRAM) are leading to consider new memory technologies as candidates for building on-chip shared last-level caches (SLLCs). Spin-Transfer Torque RAM (STT-RAM) is currently postulated as the prime contender due to its better energy efficiency, smaller die footprint and higher scalability. However, STT-RAM also exhibits some drawbacks, like slow and energy-hungry write operations that need to be mitigated before it can be used in SLLCs for the next generation of computers. In this work, we address these shortcomings by leveraging a new management mechanism for STT-RAM SLLCs. This approach is based on the previous observation that although the stream of references arriving at the SLLC of a Chip MultiProcessor (CMP) exhibits limited temporal locality, it does exhibit reuse locality, i.e. those blocks referenced several times manifest high probability of forthcoming reuse. As such, conventional STT-RAM SLLC management mechanisms, mainly focused on exploiting temporal locality, result in low efficient behavior. In this paper, we employ a cache management mechanism that selects the contents of the SLLC aimed to exploit reuse locality instead of temporal locality. Specifically, our proposal consists in the inclusion of a Reuse Detector (RD) between private cache levels and the STT-RAM SLLC. Its mission is to detect blocks that do not exhibit reuse, in order to avoid their insertion in the SLLC, hence reducing the number of write operations and the energy consumption in the STT-RAM. Our evaluation, using multiprogrammed workloads in quad-core, eight-core and 16-core systems, reveals that our scheme reports on average, energy reductions in the SLLC in the range of 37–30%, additional energy savings in the main memory in the range of 6–8% and performance improvements of 3% (quad-core), 7% (eight-core) and 14% (16-core) compared with an STT-RAM SLLC baseline where no RD is employed. More importantly, our approach outperforms DASCA, the state-of-the-art STT-RAM SLLC management, reporting—depending on the specific scenario and the kind of applications used—SLLC energy savings in the range of 4–11% higher than those of DASCA, delivering higher performance in the range of 1.5–14% and additional improvements in DRAM energy consumption in the range of 2–9% higher than DASCA

    Reuse Detector: improving the management of STT-RAM SLLCs

    Get PDF
    Various constraints of Static Random Access Memory (SRAM) are leading to consider new memory technologies as candidates for building on-chip shared last-level caches (SLLCs). Spin-Transfer Torque RAM (STT-RAM) is currently postulated as the prime contender due to its better energy efficiency, smaller die footprint and higher scalability. However, STT-RAM also exhibits some drawbacks, like slow and energy-hungry write operations that need to be mitigated before it can be used in SLLCs for the next generation of computers. In this work, we address these shortcomings by leveraging a new management mechanism for STT-RAM SLLCs. This approach is based on the previous observation that although the stream of references arriving at the SLLC of a Chip MultiProcessor (CMP) exhibits limited temporal locality, it does exhibit reuse locality, i.e. those blocks referenced several times manifest high probability of forthcoming reuse. As such, conventional STT-RAM SLLC management mechanisms, mainly focused on exploiting temporal locality, result in low efficient behavior. In this paper, we employ a cache management mechanism that selects the contents of the SLLC aimed to exploit reuse locality instead of temporal locality. Specifically, our proposal consists in the inclusion of a Reuse Detector (RD) between private cache levels and the STT-RAM SLLC. Its mission is to detect blocks that do not exhibit reuse, in order to avoid their insertion in the SLLC, hence reducing the number of write operations and the energy consumption in the STT-RAM. Our evaluation, using multiprogrammed workloads in quad-core, eight-core and 16-core systems, reveals that our scheme reports on average, energy reductions in the SLLC in the range of 37–30%, additional energy savings in the main memory in the range of 6–8% and performance improvements of 3% (quad-core), 7% (eight-core) and 14% (16-core) compared with an STT-RAM SLLC baseline where no RD is employed. More importantly, our approach outperforms DASCA, the state-of-the-art STT-RAM SLLC management, reporting—depending on the specific scenario and the kind of applications used—SLLC energy savings in the range of 4–11% higher than those of DASCA, delivering higher performance in the range of 1.5–14% and additional improvements in DRAM energy consumption in the range of 2–9% higher than DASCA.Peer ReviewedPostprint (author's final draft

    Learning from the Success of MPI

    Full text link
    The Message Passing Interface (MPI) has been extremely successful as a portable way to program high-performance parallel computers. This success has occurred in spite of the view of many that message passing is difficult and that other approaches, including automatic parallelization and directive-based parallelism, are easier to use. This paper argues that MPI has succeeded because it addresses all of the important issues in providing a parallel programming model.Comment: 12 pages, 1 figur

    The ZCache: Decoupling Ways and Associativity

    Full text link
    Abstract—The ever-increasing importance of main memory latency and bandwidth is pushing CMPs towards caches with higher capacity and associativity. Associativity is typically im-proved by increasing the number of ways. This reduces conflict misses, but increases hit latency and energy, placing a stringent trade-off on cache design. We present the zcache, a cache design that allows much higher associativity than the number of physical ways (e.g. a 64-associative cache with 4 ways). The zcache draws on previous research on skew-associative caches and cuckoo hashing. Hits, the common case, require a single lookup, incurring the latency and energy costs of a cache with a very low number of ways. On a miss, additional tag lookups happen off the critical path, yielding an arbitrarily large number of replacement candidates for the incoming block. Unlike conventional designs, the zcache provides associativity by increasing the number of replacement candidates, but not the number of cache ways. To understand the implications of this approach, we develop a general analysis framework that allows to compare associativity across different cache designs (e.g. a set-associative cache and a zcache) by representing associativity as a probability distribution. We use this framework to show that for zcaches, associativity depends only on the number of replacement candidates, and is independent of other factors (such as the number of cache ways or the workload). We also show that, for the same number of replacement candidates, the associativity of a zcache is superior than that of a set-associative cache for most workloads. Finally, we perform detailed simulations of multithreaded and multiprogrammed workloads on a large-scale CMP with zcache as the last-level cache. We show that zcaches provide higher performance and better energy efficiency than conventional caches without incurring the overheads of designs with a large number of ways. I

    ReD: A reuse detector for content selection in exclusive shared last-level caches

    Get PDF
    The reference stream reaching a chip multiprocessor Shared Last-Level Cache (SLLC) shows poor temporal locality, making conventional cache management policies inefficient. Few proposals address this problem for exclusive caches. In this paper, we propose the Reuse Detector (ReD), a new content selection mechanism for exclusive hierarchies that leverages reuse locality at the SLLC, a property that states that blocks referenced more than once are more likely to be accessed in the near future. Being placed between each L2 private cache and the SLLC, ReD prevents the insertion of blocks without reuse into the SLLC. It is designed to overcome problems affecting similar recent mechanisms (low accuracy, reduced visibility window and detector thrashing). ReD improves performance over other state-of-the-art proposals (CHAR, Reuse Cache and EAF cache). Compared with the baseline system with no content selection, it reduces the SLLC miss rate (MPI) by 10.1% and increases harmonic IPC by 9.5%.Peer ReviewedPostprint (author's final draft

    Near-optimal replacement policies for shared caches in multicore processors

    Get PDF
    An optimal replacement policy that minimizes the miss rate in a private cache was proposed several decades ago. It requires knowing the future access sequence the cache will receive. There is no equivalent for shared caches because replacement decisions alter this future sequence. We present a novel near-optimal policy for minimizing the miss rate in a shared cache that approaches the optimal execution iteratively. During each iteration, the future access sequence is reconstructed on every miss interleaving the future per-core sequences, taken from the previous iteration. This single sequence feeds a classical private-cache optimum replacement policy. Our evaluation on a shared last-level cache shows that our proposal iteratively converges to a near-optimal miss rate that is independent of the initial conditions, within a margin of 0.1%. The best state-of-the-art online policies achieve around 65% of the miss rate reduction obtained by our near-optimal proposal. In a shared cache, miss rate optimization does not imply the optimization of other metrics. Therefore, we also propose a new near-optimal policy to maximize fairness between cores. The best state-of-the-art online policy achieves 60% of the improvement in fairness seen with our near-optimal policy. Our proposals are useful both for setting upper performance bounds and inspiring implementable mechanisms for shared caches.The authors acknowledge support from grants (1) PID2019-105660RB-C21 and PID2019-107255GB-C22 from Agencia Estatal de Investigación (AEI) from Spain and European Regional Development Fund (ERDF); (2) gaZ: T58_20R research group from Aragón Government and European Social Fund (ESF); and (3) 2014-2020 "Construyendo Europa desde Aragón" from European Regional Development Fund (ERDF).Peer ReviewedPostprint (author's final draft

    Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

    Get PDF
    abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Crystal gazer : profile-driven write-rationing garbage collection for hybrid memories

    Get PDF
    Non-volatile memories (NVM) offer greater capacity than DRAM but suffer from high latency and low write endurance. Hybrid memories combine DRAM and NVM to form scalable memory systems with the promise of high capacity, low energy consumption, and high endurance. Automatically managing hybrid NVM-DRAM memories to achieve their promise without changing user applications or their programming models remains an open question. This paper uses garbage collection in managed languages to exploit NVM capacity while preventing NVM wear out in hybrid memories with no changes to the programming model. We introduce profile-driven write-rationing garbage collection. Allocation sites that produce frequently written objects are predicted based on previous program executions. Objects are initially allocated in a DRAM nursery space. The collector copies surviving nursery objects from highly written sites to a mature DRAM space and read-mostly objects to a mature NVM space.Write-intensity prediction for 15 Java benchmarks accurately places objects in the correct space, eliminating expensive object monitoring from prior write-rationing garbage collectors. Furthermore, our technique exposes a Pareto tradeoff between DRAM usage and NVM lifetime, unlike prior work. Experimental results on NUMA hardware that emulates hybrid NVM-DRAM memory demonstrates that profile-driven write-rationing garbage collection reduces the number of writes to NVM compared to prior work to extend its lifetime, maximizes the use of NVM for its capacity, and achieves good performance

    Doctor of Philosophy

    Get PDF
    dissertationIn recent years, a number of trends have started to emerge, both in microprocessor and application characteristics. As per Moore's law, the number of cores on chip will keep doubling every 18-24 months. International Technology Roadmap for Semiconductors (ITRS) reports that wires will continue to scale poorly, exacerbating the cost of on-chip communication. Cores will have to navigate an on-chip network to access data that may be scattered across many cache banks. The number of pins on the package, and hence available off-chip bandwidth, will at best increase at sublinear rate and at worst, stagnate. A number of disruptive memory technologies, e.g., phase change memory (PCM) have begun to emerge and will be integrated into the memory hierarchy sooner than later, leading to non-uniform memory access (NUMA) hierarchies. This will make the cost of accessing main memory even higher. In previous years, most of the focus has been on deciding the memory hierarchy level where data must be placed (L1 or L2 caches, main memory, disk, etc.). However, in modern and future generations, each level is getting bigger and its design is being subjected to a number of constraints (wire delays, power budget, etc.). It is becoming very important to make an intelligent decision about where data must be placed within a level. For example, in a large non-uniform access cache (NUCA), we must figure out the optimal bank. Similarly, in a multi-dual inline memory module (DIMM) non uniform memory access (NUMA) main memory, we must figure out the DIMM that is the optimal home for every data page. Studies have indicated that heterogeneous main memory hierarchies that incorporate multiple memory technologies are on the horizon. We must develop solutions for data management that take heterogeneity into account. For these memory organizations, we must again identify the appropriate home for data. In this dissertation, we attempt to verify the following thesis statement: "Can low-complexity hardware and OS mechanisms manage data placement within each memory hierarchy level to optimize metrics such as performance and/or throughput?" In this dissertation we argue for a hardware-software codesign approach to tackle the above mentioned problems at different levels of the memory hierarchy. The proposed methods utilize techniques like page coloring and shadow addresses and are able to handle a large number of problems ranging from managing wire-delays in large, shared NUCA caches to distributing shared capacity among different cores. We then examine data-placement issues in NUMA main memory for a many-core processor with a moderate number of on-chip memory controllers. Using codesign approaches, we achieve efficient data placement by modifying the operating system's (OS) page allocation algorithm for a wide variety of main memory architectures
    • …
    corecore