95 research outputs found

    HALLS: An Energy-Efficient Highly Adaptable Last Level STT-RAM Cache for Multicore Systems

    Get PDF
    Spin-Transfer Torque RAM (STT-RAM) is widely considered a promising alternative to SRAM in the memory hierarchy due to STT-RAM's non-volatility, low leakage power, high density, and fast read speed. The STT-RAM's small feature size is particularly desirable for the last-level cache (LLC), which typically consumes a large area of silicon die. However, long write latency and high write energy still remain challenges of implementing STT-RAMs in the CPU cache. An increasingly popular method for addressing this challenge involves trading off the non-volatility for reduced write speed and write energy by relaxing the STT-RAM's data retention time. However, in order to maximize energy saving potential, the cache configurations, including STT-RAM's retention time, must be dynamically adapted to executing applications' variable memory needs. In this paper, we propose a highly adaptable last level STT-RAM cache (HALLS) that allows the LLC configurations and retention time to be adapted to applications' runtime execution requirements. We also propose low-overhead runtime tuning algorithms to dynamically determine the best (lowest energy) cache configurations and retention times for executing applications. Compared to prior work, HALLS reduced the average energy consumption by 60.57% in a quad-core system, while introducing marginal latency overhead.Comment: To Appear on IEEE Transactions on Computers (TC

    A Study on Performance and Power Efficiency of Dense Non-Volatile Caches in Multi-Core Systems

    Full text link
    In this paper, we present a novel cache design based on Multi-Level Cell Spin-Transfer Torque RAM (MLC STTRAM) that can dynamically adapt the set capacity and associativity to use efficiently the full potential of MLC STTRAM. We exploit the asymmetric nature of the MLC storage scheme to build cache lines featuring heterogeneous performances, that is, half of the cache lines are read-friendly, while the other is write-friendly. Furthermore, we propose to opportunistically deactivate ways in underutilized sets to convert MLC to Single-Level Cell (SLC) mode, which features overall better performance and lifetime. Our ultimate goal is to build a cache architecture that combines the capacity advantages of MLC and performance/energy advantages of SLC. Our experiments show an improvement of 43% in total numbers of conflict misses, 27% in memory access latency, 12% in system performance, and 26% in LLC access energy, with a slight degradation in cache lifetime (about 7%) compared to an SLC cache

    Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in Hybrid Tiered-Memories

    Full text link
    Modern computing systems are embracing hybrid memory comprising of DRAM and non-volatile memory (NVM) to combine the best properties of both memory technologies, achieving low latency, high reliability, and high density. A prominent characteristic of DRAM-NVM hybrid memory is that it has NVM access latency much higher than DRAM access latency. We call this inter-memory asymmetry. We observe that parasitic components on a long bitline are a major source of high latency in both DRAM and NVM, and a significant factor contributing to high-voltage operations in NVM, which impact their reliability. We propose an architectural change, where each long bitline in DRAM and NVM is split into two segments by an isolation transistor. One segment can be accessed with lower latency and operating voltage than the other. By introducing tiers, we enable non-uniform accesses within each memory type (which we call intra-memory asymmetry), leading to performance and reliability trade-offs in DRAM-NVM hybrid memory. We extend existing NVM-DRAM OS in three ways. First, we exploit both inter- and intra-memory asymmetries to allocate and migrate memory pages between the tiers in DRAM and NVM. Second, we improve the OS's page allocation decisions by predicting the access intensity of a newly-referenced memory page in a program and placing it to a matching tier during its initial allocation. This minimizes page migrations during program execution, lowering the performance overhead. Third, we propose a solution to migrate pages between the tiers of the same memory without transferring data over the memory channel, minimizing channel occupancy and improving performance. Our overall approach, which we call MNEME, to enable and exploit asymmetries in DRAM-NVM hybrid tiered memory improves both performance and reliability for both single-core and multi-programmed workloads.Comment: 15 pages, 29 figures, accepted at ACM SIGPLAN International Symposium on Memory Managemen

    Evaluation of STT-MRAM main memory for HPC and real-time systems

    Get PDF
    It is questionable whether DRAM will continue to scale and will meet the needs of next-generation systems. Therefore, significant effort is invested in research and development of novel memory technologies. One of the candidates for nextgeneration memory is Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM). STT-MRAM is an emerging non-volatile memory with a lot of potential that could be exploited for various requirements of different computing systems. Being a novel technology, STT-MRAM devices are already approaching DRAM in terms of capacity, frequency and device size. Special STT-MRAM features such as intrinsic radiation hardness, non-volatility, zero stand-by power and capability to function in extreme temperatures also make it particularly suitable for aerospace, avionics and automotive applications. Despite of being a conceivable alternative for main memory technology, to this day, academic research of STT-MRAM main memory remains marginal. This is mainly due to the unavailability of publicly available detailed timing parameters of this novel technology, which are required to perform a cycle accurate main memory simulation. Some researchers adopt simplistic memory models to simulate main memory, but such models can introduce significant errors in the analysis of the overall system performance. Therefore, detailed timing parameters are a must-have for any evaluation or architecture exploration study of STT-MRAM main memory. These detailed parameters are not publicly available because STT-MRAM manufacturers are reluctant to release any delicate information on the technology. This thesis demonstrates an approach to perform a cycle accurate simulation of STT-MRAM main memory, being the first to release detailed timing parameters of this technology from academia, essentially enabling researchers to conduct reliable system level simulation of STT-MRAM using widely accepted existing simulation infrastructure. Our results show that, in HPC domain STT-MRAM provide performance comparable to DRAM. Results from the power estimation indicates that STT-MRAM power consumption increases significantly for Activation/Precharge power while Burst power increases moderately and Background power does not deviate much from DRAM. The thesis includes detailed STT-MRAM main memory timing parameters to the main repositories of DramSim2 and Ramulator, two of the most widely used and accepted state-of-the-art main memory simulators. The STT-MRAM timing parameters that has been originated as a part of this thesis, are till date the only reliable and publicly available timing information on this memory technology published from academia. Finally, the thesis analyzes the feasibility of using STT-MRAM in real-time embedded systems by investigating STT-MRAM main memory impact on average system performance and WCET. STT-MRAM's suitability for the real-time embedded systems is validated on benchmarks provided by the European Space Agency (ESA), EEMBC Autobench and MediaBench suite by analyzing performance and WCET impact. In quantitative terms, our results show that STT-MRAM main memory in real-time embedded systems provides performance and WCET comparable to conventional DRAM, while opening up opportunities to exploit various advantages.Es cuestionable si DRAM continuará escalando y cumplirá con las necesidades de los sistemas de la próxima generación. Por lo tanto, se invierte un esfuerzo significativo en la investigación y el desarrollo de nuevas tecnologías de memoria. Uno de los candidatos para la memoria de próxima generación es la Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM). STT-MRAM es una memoria no volátil emergente con un gran potencial que podría ser explotada para diversos requisitos de diferentes sistemas informáticos. Al ser una tecnología novedosa, los dispositivos STT-MRAM ya se están acercando a la DRAM en términos de capacidad, frecuencia y tamaño del dispositivo. Las características especiales de STTMRAM, como la dureza intrínseca a la radiación, la no volatilidad, la potencia de reserva cero y la capacidad de funcionar en temperaturas extremas, también lo hacen especialmente adecuado para aplicaciones aeroespaciales, de aviónica y automotriz. A pesar de ser una alternativa concebible para la tecnología de memoria principal, hasta la fecha, la investigación académica de la memoria principal de STT-MRAM sigue siendo marginal. Esto se debe principalmente a la falta de disponibilidad de los parámetros de tiempo detallados públicamente disponibles de esta nueva tecnología, que se requieren para realizar un ciclo de simulación de memoria principal precisa. Algunos investigadores adoptan modelos de memoria simplistas para simular la memoria principal, pero tales modelos pueden introducir errores significativos en el análisis del rendimiento general del sistema. Por lo tanto, los parámetros de tiempo detallados son indispensables para cualquier evaluación o estudio de exploración de la arquitectura de la memoria principal de STT-MRAM. Estos parámetros detallados no están disponibles públicamente porque los fabricantes de STT-MRAM son reacios a divulgar información delicada sobre la tecnología. Esta tesis demuestra un enfoque para realizar un ciclo de simulación precisa de la memoria principal de STT-MRAM, siendo el primero en lanzar parámetros de tiempo detallados de esta tecnología desde la academia, lo que esencialmente permite a los investigadores realizar una simulación confiable a nivel de sistema de STT-MRAM utilizando una simulación existente ampliamente aceptada infraestructura. Nuestros resultados muestran que, en el dominio HPC, STT-MRAM proporciona un rendimiento comparable al de la DRAM. Los resultados de la estimación de potencia indican que el consumo de potencia de STT-MRAM aumenta significativamente para la activation/Precharge power, mientras que la Burst power aumenta moderadamente y la Background power no se desvía mucho de la DRAM. La tesis incluye parámetros detallados de temporización memoria principal de STT-MRAM a los repositorios principales de DramSim2 y Ramulator, dos de los simuladores de memoria principal más avanzados y más utilizados y aceptados. Los parámetros de tiempo de STT-MRAM que se han originado como parte de esta tesis, son hasta la fecha la única información de tiempo confiable y disponible al público sobre esta tecnología de memoria publicada desde la academia. Finalmente, la tesis analiza la viabilidad de usar STT-MRAM en real-time embedded systems mediante la investigación del impacto de la memoria principal de STT-MRAM en el rendimiento promedio del sistema y WCET. La idoneidad de STTMRAM para los real-time embedded systems se valida en los applicaciones proporcionados por la European Space Agency (ESA), EEMBC Autobench y MediaBench, al analizar el rendimiento y el impacto de WCET. En términos cuantitativos, nuestros resultados muestran que la memoria principal de STT-MRAM en real-time embedded systems proporciona un desempeño WCET comparable al de una memoria DRAM convencional, al tiempo que abre oportunidades para explotar varias ventajas

    Software Techniques for Energy Efficient Memories

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    새로운 메모리 기술을 기반으로 한 메모리 시스템 설계 기술

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 최기영.Performance and energy efficiency of modern computer systems are largely dominated by the memory system. This memory bottleneck has been exacerbated in the past few years with (1) architectural innovations for improving the efficiency of computation units (e.g., chip multiprocessors), which shift the major cause of inefficiency from processors to memory, and (2) the emergence of data-intensive applications, which demands a large capacity of main memory and an excessive amount of memory bandwidth to efficiently handle such workloads. In order to address this memory wall challenge, this dissertation aims at exploring the potential of emerging memory technologies and designing a high-performance, energy-efficient memory hierarchy that is aware of and leverages the characteristics of such new memory technologies. The first part of this dissertation focuses on energy-efficient on-chip cache design based on a new non-volatile memory technology called Spin-Transfer Torque RAM (STT-RAM). When STT-RAM is used to build on-chip caches, it provides several advantages over conventional charge-based memory (e.g., SRAM or eDRAM), such as non-volatility, lower static power, and higher density. However, simply replacing SRAM caches with STT-RAM rather increases the energy consumption because write operations of STT-RAM are slower and more energy-consuming than those of SRAM. To address this challenge, we propose four novel architectural techniques that can alleviate the impact of inefficient STT-RAM write operations on system performance and energy consumption. First, we apply STT-RAM to instruction caches (where write operations are relatively infrequent) and devise a power-gating mechanism called LASIC, which leverages the non-volatility of STT-RAM to turn off STT-RAM instruction caches inside small loops. Second, we propose lower-bits cache, which exploits the narrow bit-width characteristics of application data by caching frequent bit-flips at lower bits in a small SRAM cache. Third, we present prediction hybrid cache, an SRAM/STT-RAM hybrid cache whose block placement between SRAM and STT-RAM is determined by predicting the write intensity of each cache block with a new hardware structure called write intensity predictor. Fourth, we propose DASCA, which predicts write operations that can bypass the cache without incurring extra cache misses (called dead writes) and lets the last-level cache bypass such dead writes to reduce write energy consumption. The second part of this dissertation architects intelligent main memory and its host architecture support based on logic-enabled DRAM. Traditionally, main memory has served the sole purpose of storing data because the extra manufacturing cost of implementing rich functionality (e.g., computation) on a DRAM die was unacceptably high. However, the advent of 3D die stacking now provides a practical, cost-effective way to integrate complex logic circuits into main memory, thereby opening up the possibilities for intelligent main memory. For example, it can be utilized to implement advanced memory management features (e.g., scheduling, power management, etc.) inside memoryit can be also used to offload computation to main memory, which allows us to overcome the memory bandwidth bottleneck caused by narrow off-chip channels (commonly known as processing-in-memory or PIM). The remaining questions are what to implement inside main memory and how to integrate and expose such new features to existing systems. In order to answer these questions, we propose four system designs that utilize logic-enabled DRAM to improve system performance and energy efficiency. First, we utilize the existing logic layer of a Hybrid Memory Cube (a commercial logic-enabled DRAM product) to (1) dynamically turn off some of its off-chip links by monitoring the actual bandwidth demand and (2) integrate prefetch buffer into main memory to perform aggressive prefetching without consuming off-chip link bandwidth. Second, we propose a scalable accelerator for large-scale graph processing called Tesseract, in which graph processing computation is offloaded to specialized processors inside main memory in order to achieve memory-capacity-proportional performance. Third, we design a low-overhead PIM architecture for near-term adoption called PIM-enabled instructions, where PIM operations are interfaced as cache-coherent, virtually-addressed host processor instructions that can be executed either by the host processor or in main memory depending on the data locality. Fourth, we propose an energy-efficient PIM system called aggregation-in-memory, which can adaptively execute PIM operations at any level of the memory hierarchy and provides a fully automated compiler toolchain that transforms existing applications to use PIM operations without programmer intervention.Chapter 1 Introduction 1 1.1 Inefficiencies in the Current Memory Systems 2 1.1.1 On-Chip Caches 2 1.1.2 Main Memory 2 1.2 New Memory Technologies: Opportunities and Challenges 3 1.2.1 Energy-Efficient On-Chip Caches based on STT-RAM 3 1.2.2 Intelligent Main Memory based on Logic-Enabled DRAM 6 1.3 Dissertation Overview 9 Chapter 2 Previous Work 11 2.1 Energy-Efficient On-Chip Caches based on STT-RAM 11 2.1.1 Hybrid Caches 11 2.1.2 Volatile STT-RAM 13 2.1.3 Redundant Write Elimination 14 2.2 Intelligent Main Memory based on Logic-Enabled DRAM 15 2.2.1 PIM Architectures in the 1990s 15 2.2.2 Modern PIM Architectures based on 3D Stacking 15 2.2.3 Modern PIM Architectures on Memory Dies 17 Chapter 3 Loop-Aware Sleepy Instruction Cache 19 3.1 Architecture 20 3.1.1 Loop Cache 21 3.1.2 Loop-Aware Sleep Controller 22 3.2 Evaluation and Discussion 24 3.2.1 Simulation Environment 24 3.2.2 Energy 25 3.2.3 Performance 27 3.2.4 Sensitivity Analysis 27 3.3 Summary 28 Chapter 4 Lower-Bits Cache 29 4.1 Architecture 29 4.2 Experiments 32 4.2.1 Simulator and Cache Model 32 4.2.2 Results 33 4.3 Summary 34 Chapter 5 Prediction Hybrid Cache 35 5.1 Problem and Motivation 37 5.1.1 Problem Definition 37 5.1.2 Motivation 37 5.2 Write Intensity Predictor 38 5.2.1 Keeping Track of Trigger Instructions 39 5.2.2 Identifying Hot Trigger Instructions 40 5.2.3 Dynamic Set Sampling 41 5.2.4 Summary 42 5.3 Prediction Hybrid Cache 43 5.3.1 Need for Write Intensity Prediction 43 5.3.2 Organization 43 5.3.3 Operations 44 5.3.4 Dynamic Threshold Adjustment 45 5.4 Evaluation Methodology 48 5.4.1 Simulator Configuration 48 5.4.2 Workloads 50 5.5 Single-Core Evaluations 51 5.5.1 Energy Consumption and Speedup 51 5.5.2 Energy Breakdown 53 5.5.3 Coverage and Accuracy 54 5.5.4 Sensitivity to Write Intensity Threshold 55 5.5.5 Impact of Dynamic Set Sampling 55 5.5.6 Results for Non-Write-Intensive Workloads 56 5.6 Multicore Evaluations 57 5.7 Summary 59 Chapter 6 Dead Write Prediction Assisted STT-RAM Cache 61 6.1 Motivation 62 6.1.1 Energy Impact of Inefficient Write Operations 62 6.1.2 Limitations of Existing Approaches 63 6.1.3 Potential of Dead Writes 64 6.2 Dead Write Classification 65 6.2.1 Dead-on-Arrival Fills 65 6.2.2 Dead-Value Fills 66 6.2.3 Closing Writes 66 6.2.4 Decomposition 67 6.3 Dead Write Prediction Assisted STT-RAM Cache Architecture 68 6.3.1 Dead Write Prediction 68 6.3.2 Bidirectional Bypass 71 6.4 Evaluation Methodology 72 6.4.1 Simulation Configuration 72 6.4.2 Workloads 74 6.5 Evaluation for Single-Core Systems 75 6.5.1 Energy Consumption and Speedup 75 6.5.2 Coverage and Accuracy 78 6.5.3 Sensitivity to Signature 78 6.5.4 Sensitivity to Update Policy 80 6.5.5 Implications of Device-/Circuit-Level Techniques for Write Energy Reduction 80 6.5.6 Impact of Prefetching 80 6.6 Evaluation for Multi-Core Systems 81 6.6.1 Energy Consumption and Speedup 81 6.6.2 Application to Inclusive Caches 83 6.6.3 Application to Three-Level Cache Hierarchy 84 6.7 Summary 85 Chapter 7 Link Power Management for Hybrid Memory Cubes 87 7.1 Background and Motivation 88 7.1.1 Hybrid Memory Cube 88 7.1.2 Motivation 89 7.2 HMC Link Power Management 91 7.2.1 Link Delay Monitor 91 7.2.2 Power State Transition 94 7.2.3 Overhead 95 7.3 Two-Level Prefetching 95 7.4 Application to Multi-HMC Systems 97 7.5 Experiments 98 7.5.1 Methodology 98 7.5.2 Link Energy Consumption and Speedup 100 7.5.3 HMC Energy Consumption 102 7.5.4 Runtime Behavior of LPM 102 7.5.5 Sensitivity to Slowdown Threshold 104 7.5.6 LPM without Prefetching 104 7.5.7 Impact of Prefetching on Link Traffic 105 7.5.8 On-Chip Prefetcher Aggressiveness in 2LP 107 7.5.9 Tighter Off-Chip Bandwidth Margin 107 7.5.10 Multithreaded Workloads 108 7.5.11 Multi-HMC Systems 109 7.6 Summary 111 Chapter 8 Tesseract PIM System for Parallel Graph Processing 113 8.1 Background and Motivation 115 8.1.1 Large-Scale Graph Processing 115 8.1.2 Graph Processing on Conventional Systems 117 8.1.3 Processing-in-Memory 118 8.2 Tesseract Architecture 119 8.2.1 Overview 119 8.2.2 Remote Function Call via Message Passing 122 8.2.3 Prefetching 124 8.2.4 Programming Interface 126 8.2.5 Application Mapping 127 8.3 Evaluation Methodology 128 8.3.1 Simulation Configuration 128 8.3.2 Workloads 129 8.4 Evaluation Results 130 8.4.1 Performance 130 8.4.2 Iso-Bandwidth Comparison 133 8.4.3 Execution Time Breakdown 134 8.4.4 Prefetch Efficiency 134 8.4.5 Scalability 135 8.4.6 Effect of Higher Off-Chip Network Bandwidth 136 8.4.7 Effect of Better Graph Distribution 137 8.4.8 Energy/Power Consumption and Thermal Analysis 138 8.5 Summary 139 Chapter 9 PIM-Enabled Instructions 141 9.1 Potential of ISA Extensions as the PIM Interface 143 9.2 PIM Abstraction 145 9.2.1 Operations 145 9.2.2 Memory Model 147 9.2.3 Software Modification 148 9.3 Architecture 148 9.3.1 Overview 148 9.3.2 PEI Computation Unit (PCU) 149 9.3.3 PEI Management Unit (PMU) 150 9.3.4 Virtual Memory Support 153 9.3.5 PEI Execution 153 9.3.6 Comparison with Active Memory Operations 154 9.4 Target Applications for Case Study 155 9.4.1 Large-Scale Graph Processing 155 9.4.2 In-Memory Data Analytics 156 9.4.3 Machine Learning and Data Mining 157 9.4.4 Operation Summary 157 9.5 Evaluation Methodology 158 9.5.1 Simulation Configuration 158 9.5.2 Workloads 159 9.6 Evaluation Results 159 9.6.1 Performance 160 9.6.2 Sensitivity to Input Size 163 9.6.3 Multiprogrammed Workloads 164 9.6.4 Balanced Dispatch: Idea and Evaluation 165 9.6.5 Design Space Exploration for PCUs 165 9.6.6 Performance Overhead of the PMU 167 9.6.7 Energy, Area, and Thermal Issues 167 9.7 Summary 168 Chapter 10 Aggregation-in-Memory 171 10.1 Motivation 173 10.1.1 Rethinking PIM for Energy Efficiency 173 10.1.2 Aggregation as PIM Operations 174 10.2 Architecture 176 10.2.1 Overview 176 10.2.2 Programming Model 177 10.2.3 On-Chip Caches 177 10.2.4 Coherence and Consistency 181 10.2.5 Main Memory 181 10.2.6 Potential Generalization Opportunities 183 10.3 Compiler Support 184 10.4 Contributions over Prior Art 185 10.4.1 PIM-Enabled Instructions 185 10.4.2 Parallel Reduction in Caches 187 10.4.3 Row Buffer Locality of DRAM Writes 188 10.5 Target Applications 188 10.6 Evaluation Methodology 190 10.6.1 Simulation Configuration 190 10.6.2 Hardware Overhead 191 10.6.3 Workloads 192 10.7 Evaluation Results 192 10.7.1 Energy Consumption and Performance 192 10.7.2 Dynamic Energy Breakdown 196 10.7.3 Comparison with Aggressive Writeback 197 10.7.4 Multiprogrammed Workloads 198 10.7.5 Comparison with Intrinsic-based Code 198 10.8 Summary 199 Chapter 11 Conclusion 201 11.1 Energy-Efficient On-Chip Caches based on STT-RAM 202 11.2 Intelligent Main Memory based on Logic-Enabled DRAM 203 Bibliography 205 요약 227Docto
    corecore