5 research outputs found

    Reuse Detector: improving the management of STT-RAM SLLCs

    Get PDF
    Various constraints of Static Random Access Memory (SRAM) are leading to consider new memory technologies as candidates for building on-chip shared last-level caches (SLLCs). Spin-Transfer Torque RAM (STT-RAM) is currently postulated as the prime contender due to its better energy efficiency, smaller die footprint and higher scalability. However, STT-RAM also exhibits some drawbacks, like slow and energy-hungry write operations that need to be mitigated before it can be used in SLLCs for the next generation of computers. In this work, we address these shortcomings by leveraging a new management mechanism for STT-RAM SLLCs. This approach is based on the previous observation that although the stream of references arriving at the SLLC of a Chip MultiProcessor (CMP) exhibits limited temporal locality, it does exhibit reuse locality, i.e. those blocks referenced several times manifest high probability of forthcoming reuse. As such, conventional STT-RAM SLLC management mechanisms, mainly focused on exploiting temporal locality, result in low efficient behavior. In this paper, we employ a cache management mechanism that selects the contents of the SLLC aimed to exploit reuse locality instead of temporal locality. Specifically, our proposal consists in the inclusion of a Reuse Detector (RD) between private cache levels and the STT-RAM SLLC. Its mission is to detect blocks that do not exhibit reuse, in order to avoid their insertion in the SLLC, hence reducing the number of write operations and the energy consumption in the STT-RAM. Our evaluation, using multiprogrammed workloads in quad-core, eight-core and 16-core systems, reveals that our scheme reports on average, energy reductions in the SLLC in the range of 37–30%, additional energy savings in the main memory in the range of 6–8% and performance improvements of 3% (quad-core), 7% (eight-core) and 14% (16-core) compared with an STT-RAM SLLC baseline where no RD is employed. More importantly, our approach outperforms DASCA, the state-of-the-art STT-RAM SLLC management, reporting—depending on the specific scenario and the kind of applications used—SLLC energy savings in the range of 4–11% higher than those of DASCA, delivering higher performance in the range of 1.5–14% and additional improvements in DRAM energy consumption in the range of 2–9% higher than DASCA.Peer ReviewedPostprint (author's final draft

    Reuse Detector: Improving the management of STT-RAM SLLCs

    Get PDF
    Various constraints of Static Random Access Memory (SRAM) are leading to consider new memory technologies as candidates for building on-chip shared last-level caches (SLLCs). Spin-Transfer Torque RAM (STT-RAM) is currently postulated as the prime contender due to its better energy efficiency, smaller die footprint and higher scalability. However, STT-RAM also exhibits some drawbacks, like slow and energy-hungry write operations that need to be mitigated before it can be used in SLLCs for the next generation of computers. In this work, we address these shortcomings by leveraging a new management mechanism for STT-RAM SLLCs. This approach is based on the previous observation that although the stream of references arriving at the SLLC of a Chip MultiProcessor (CMP) exhibits limited temporal locality, it does exhibit reuse locality, i.e. those blocks referenced several times manifest high probability of forthcoming reuse. As such, conventional STT-RAM SLLC management mechanisms, mainly focused on exploiting temporal locality, result in low efficient behavior. In this paper, we employ a cache management mechanism that selects the contents of the SLLC aimed to exploit reuse locality instead of temporal locality. Specifically, our proposal consists in the inclusion of a Reuse Detector (RD) between private cache levels and the STT-RAM SLLC. Its mission is to detect blocks that do not exhibit reuse, in order to avoid their insertion in the SLLC, hence reducing the number of write operations and the energy consumption in the STT-RAM. Our evaluation, using multiprogrammed workloads in quad-core, eight-core and 16-core systems, reveals that our scheme reports on average, energy reductions in the SLLC in the range of 37–30%, additional energy savings in the main memory in the range of 6–8% and performance improvements of 3% (quad-core), 7% (eight-core) and 14% (16-core) compared with an STT-RAM SLLC baseline where no RD is employed. More importantly, our approach outperforms DASCA, the state-of-the-art STT-RAM SLLC management, reporting—depending on the specific scenario and the kind of applications used—SLLC energy savings in the range of 4–11% higher than those of DASCA, delivering higher performance in the range of 1.5–14% and additional improvements in DRAM energy consumption in the range of 2–9% higher than DASCA

    Vertical Memory Optimization for High Performance Energy-efficient GPU

    Get PDF
    GPU heavily relies on massive multi-threading to achieve high throughput. The massive multi-threading imposes tremendous pressure on different storage components. This dissertation focuses on the optimization of memory subsystem including register file, L1 data cache and device memory, all of which are featured by the massive multi-threading and dominate the efficiency and scalability of GPU. A large register file is demanded in GPU for supporting fast thread switching. This dissertation first introduces a power-efficient GPU register file built on the newly emerged racetrack memory (RM). However, the shift operators of RM results in extra power and timing overhead. A holistic architecture-level technology set is developed to conquer the adverse impacts and guarantees its energy merit. Experiment results show that the proposed techniques can keep GPU performance stable compared to the baseline with SRAM based RF. Register file energy is significantly reduced by 48.5%. This work then proposes a versatile warp scheduler (VWS) to reduce the L1 data cache misses in GPU. VWS retains the intra-warp cache locality with a simple yet effective per-warp working set estimator, and enhances intra- and inter-thread-block cache locality using a thread block aware scheduler. VWS achieves on average 38.4% and 9.3% IPC improvement compared to a widely-used and a state-of-the-art warp schedulers, respectively. At last this work targets the off-chip DRAM based device memory. An integrated architecture substrate is introduced to improve the performance and energy efficiency of GPU through the efficient bandwidth utilization. The first part of the architecture substrate, thread batch enabled memory partitioning (TEMP) improves memory access parallelism. TEMP introduces thread batching to separate the memory access streams from SMs. The second part, Thread batch-aware scheduler (TBAS) is then designed to improve memory access locality. Experimental results show that TEMP and TBAS together can obtain up to 10.3% performance improvement and 11.3% DRAM energy reduction for GPU workloads

    Improving Processor Design by Exploiting Performance Variance

    Get PDF
    Programs exhibit significant performance variance in their access to microarchitectural structures. There are three types of performance variance. First, semantically equivalent programs running on the same system can yield different performance due to characteristics of microarchitectural structures. Second, program phase behavior varies significantly. Third, different types of operations on microarchitectural structure can lead to different performance. In this dissertation, we explore the performance variance and propose techniques to improve the processor design. We explore performance variance caused by microarchitectural structures and propose program interferometry, a technique that perturbs benchmark executables to yield a wide variety of performance points without changing program semantics or other important execution characteristics such as the number of retired instructions. By observing the behavior of the benchmarks over a range of branch prediction accuracies, we can estimate the impact of a microarchitectural optimization optimization and not the rest of the microarchitecture. We explore performance variance caused by phase changes and develop prediction-driven last-level cache (LLC) writeback techniques. We propose a rank idle time prediction driven LLC writeback technique and a last-write prediction driven LLC writeback technique. These techniques improve performance by reducing the write-induced interference. We explore performance variance caused by different types of operations to Non-Volatile Memory (NVM) and propose LLC management policies to reduce write overhead of NVM.We propose an adaptive placement and migration policy for an STT-RAM-based hybrid cache and writeback aware dynamic cache management for NVM-based main memory system. These techniques reduce write latency and write energy, thus leading to performance improvement and energy reduction
    corecore