24 research outputs found

    Accelerating Time Series Analysis via Processing using Non-Volatile Memories

    Full text link
    Time Series Analysis (TSA) is a critical workload for consumer-facing devices. Accelerating TSA is vital for many domains as it enables the extraction of valuable information and predict future events. The state-of-the-art algorithm in TSA is the subsequence Dynamic Time Warping (sDTW) algorithm. However, sDTW's computation complexity increases quadratically with the time series' length, resulting in two performance implications. First, the amount of data parallelism available is significantly higher than the small number of processing units enabled by commodity systems (e.g., CPUs). Second, sDTW is bottlenecked by memory because it 1) has low arithmetic intensity and 2) incurs a large memory footprint. To tackle these two challenges, we leverage Processing-using-Memory (PuM) by performing in-situ computation where data resides, using the memory cells. PuM provides a promising solution to alleviate data movement bottlenecks and exposes immense parallelism. In this work, we present MATSA, the first MRAM-based Accelerator for Time Series Analysis. The key idea is to exploit magneto-resistive memory crossbars to enable energy-efficient and fast time series computation in memory. MATSA provides the following key benefits: 1) it leverages high levels of parallelism in the memory substrate by exploiting column-wise arithmetic operations, and 2) it significantly reduces the data movement costs performing computation using the memory cells. We evaluate three versions of MATSA to match the requirements of different environments (e.g., embedded, desktop, or HPC computing) based on MRAM technology trends. We perform a design space exploration and demonstrate that our HPC version of MATSA can improve performance by 7.35x/6.15x/6.31x and energy efficiency by 11.29x/4.21x/2.65x over server CPU, GPU and PNM architectures, respectively

    Monolithically Integrated SRAM-ReRAM Cache-Main Memory System

    Get PDF
    Emerging non-volatile memories are dense and potentially compatible with standard CMOS processes, enabling a monolithically integrated CPU-main memory chip. However, area constraints impact the feasibility of fitting the entirety of a multi-core CPU and main memory system into a single die. ReRAM presents a unique opportunity in that it can be fabricated in crosspoint subarrays which leave the bulk of transistors beneath them available for other logic. However, ReRAM also poses a performance challenge; the latency is generally much higher than that of DRAM. Compensating for this through the increased bandwidth afforded from being on-die poses an architectural problem. The access circuitry for ReRAM subarrays requires only a small percentage of the area beneath the array. Still, this dense circuitry and wiring disrupts the layouts of irregular logic like CPUs. Caches are very regular and composed of smaller subarrays, making them a better candidate to place beneath crosspoint subarrays. By co-designing the cache subarrays and ReRAM crosspoint subarrays, minimal disruption to the cache logic can be achieved while still covering the bulk of the last-level cache area in ReRAM. This work explores the design space when co-designing the last-level cache and ReRAM crosspoint subarrays. Using a modified version of Cacti, we are able to explore the design trade-offs when integrating ReRAM and cache and quantify the impact the ReRAM has on the last-level cache. This design space exploration gives us a first order approximation of the memory capacity of a monolithic computer and informs architectural simulations of such a machine. We also examine how the physical integration presents opportunities for logical integration of the last-level cache and main memory. The interconnects and controllers can be combined, and the addressing can be such that data movement between the main memory and cache is primarily vertical. These optimizations can result in area and energy savings with minor impacts on performance. The second section of this work explores one architectural style which can balance the monolithic memory system and a general-purpose compute system---a tiled multicore with wide SIMD and multi-threading. We develop a simulator for this architecture capable of simulating a wide variety of system parameters. Through a design space exploration of many of the parameters across sparse, irregular graph kernels and dense, streaming computations, we find monolithic ReRAM exceeds the performance of a state-of-the-art DRAM system for memory intensive workloads given enough parallelism. We further develop an analytic model to describe our system and highlight the important performance characteristics for a monolithic CPU-main memory chip. The analytic model is validated against our simulation data. Using this model, we examine the architectural balance of the systems we simulated. Finally, we develop an RTL model of the combined cache--main memory interface. This gives a more accurate model for the increase in resources required for the combined controller. We additionally develop a system-on-a-chip with an RTL model that alters requests to the FPGA's main memory to be at the speed of ReRAM requests. This model is used to show the performance of more computationally intensive benchmarks. It also is the first step toward creating a test chip for a monolithically integrated ReRAM main memory

    Thermal Aware Design Automation of the Electronic Control System for Autonomous Vehicles

    Get PDF
    The autonomous vehicle (AV) technology, due to its tremendous social and economical benefits, is transforming the entire world in the coming decades. However, significant technical challenges still need to be overcome until AVs can be safely, reliably, and massively deployed. Temperature plays a key role in the safety and reliability of an AV, not only because a vehicle is subjected to extreme operating temperatures but also because the increasing computations demand more powerful IC chips, which can lead to higher operating temperature and large thermal gradient. In particular, as the underpinning technology for AV, artificial intelligence (AI) requires substantially increased computation and memory resources, which have been growing exponentially through recent years and further exacerbated the thermal problems. High operating temperature and large thermal gradient can reduce the performance, degrade the reliability, and even cause an IC to fail catastrophically. We believe that dealing with thermal issues must be coupled closely in the design phase of the AVs’ electronic control system (ECS). To this end, first, we study how to map vehicle applications to ECS with heterogeneous architecture to satisfy peak temperature constraints and optimize latency and system-level reliability. We present a mathematical programming model to bound the peak temperature for the ECS. We also develop an approach based on the genetic algorithm to bound the peak temperature under varying execution time scenarios and optimize the system-level reliability of the ECS. We present several computationally efficient techniques for system-level mean-time-to-failure (MTTF) computation, which show several orders-of-magnitude speed-up over the state-of-the-art method. Second, we focus on studying the thermal impacts of AI techniques. Specifically, we study how the thermal impacts for the memory bit flipping can affect the prediction accuracy of a deep neural network (DNN). We develop a neuron-level analytical sensitivity estimation framework to quantify this impact and study its effectiveness with popular DNN architectures. Third, we study the problem of incorporating thermal impacts into mapping the parameters for DNN neurons to memory banks to improve prediction accuracy. Based on our developed sensitivity metric, we develop a bin-packing-based approach to map DNN neuron parameters to memory banks with different temperature profiles. We also study the problem of identifying the optimal temperature profiles for memory systems that can minimize the thermal impacts. We show that the thermal aware mapping of DNN neuron parameters on memory banks can significantly improve the prediction accuracy at a high-temperature range than the thermal ignorant for state-of-the-art DNNs

    새로운 메모리 기술을 기반으로 한 메모리 시스템 설계 기술

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 최기영.Performance and energy efficiency of modern computer systems are largely dominated by the memory system. This memory bottleneck has been exacerbated in the past few years with (1) architectural innovations for improving the efficiency of computation units (e.g., chip multiprocessors), which shift the major cause of inefficiency from processors to memory, and (2) the emergence of data-intensive applications, which demands a large capacity of main memory and an excessive amount of memory bandwidth to efficiently handle such workloads. In order to address this memory wall challenge, this dissertation aims at exploring the potential of emerging memory technologies and designing a high-performance, energy-efficient memory hierarchy that is aware of and leverages the characteristics of such new memory technologies. The first part of this dissertation focuses on energy-efficient on-chip cache design based on a new non-volatile memory technology called Spin-Transfer Torque RAM (STT-RAM). When STT-RAM is used to build on-chip caches, it provides several advantages over conventional charge-based memory (e.g., SRAM or eDRAM), such as non-volatility, lower static power, and higher density. However, simply replacing SRAM caches with STT-RAM rather increases the energy consumption because write operations of STT-RAM are slower and more energy-consuming than those of SRAM. To address this challenge, we propose four novel architectural techniques that can alleviate the impact of inefficient STT-RAM write operations on system performance and energy consumption. First, we apply STT-RAM to instruction caches (where write operations are relatively infrequent) and devise a power-gating mechanism called LASIC, which leverages the non-volatility of STT-RAM to turn off STT-RAM instruction caches inside small loops. Second, we propose lower-bits cache, which exploits the narrow bit-width characteristics of application data by caching frequent bit-flips at lower bits in a small SRAM cache. Third, we present prediction hybrid cache, an SRAM/STT-RAM hybrid cache whose block placement between SRAM and STT-RAM is determined by predicting the write intensity of each cache block with a new hardware structure called write intensity predictor. Fourth, we propose DASCA, which predicts write operations that can bypass the cache without incurring extra cache misses (called dead writes) and lets the last-level cache bypass such dead writes to reduce write energy consumption. The second part of this dissertation architects intelligent main memory and its host architecture support based on logic-enabled DRAM. Traditionally, main memory has served the sole purpose of storing data because the extra manufacturing cost of implementing rich functionality (e.g., computation) on a DRAM die was unacceptably high. However, the advent of 3D die stacking now provides a practical, cost-effective way to integrate complex logic circuits into main memory, thereby opening up the possibilities for intelligent main memory. For example, it can be utilized to implement advanced memory management features (e.g., scheduling, power management, etc.) inside memoryit can be also used to offload computation to main memory, which allows us to overcome the memory bandwidth bottleneck caused by narrow off-chip channels (commonly known as processing-in-memory or PIM). The remaining questions are what to implement inside main memory and how to integrate and expose such new features to existing systems. In order to answer these questions, we propose four system designs that utilize logic-enabled DRAM to improve system performance and energy efficiency. First, we utilize the existing logic layer of a Hybrid Memory Cube (a commercial logic-enabled DRAM product) to (1) dynamically turn off some of its off-chip links by monitoring the actual bandwidth demand and (2) integrate prefetch buffer into main memory to perform aggressive prefetching without consuming off-chip link bandwidth. Second, we propose a scalable accelerator for large-scale graph processing called Tesseract, in which graph processing computation is offloaded to specialized processors inside main memory in order to achieve memory-capacity-proportional performance. Third, we design a low-overhead PIM architecture for near-term adoption called PIM-enabled instructions, where PIM operations are interfaced as cache-coherent, virtually-addressed host processor instructions that can be executed either by the host processor or in main memory depending on the data locality. Fourth, we propose an energy-efficient PIM system called aggregation-in-memory, which can adaptively execute PIM operations at any level of the memory hierarchy and provides a fully automated compiler toolchain that transforms existing applications to use PIM operations without programmer intervention.Chapter 1 Introduction 1 1.1 Inefficiencies in the Current Memory Systems 2 1.1.1 On-Chip Caches 2 1.1.2 Main Memory 2 1.2 New Memory Technologies: Opportunities and Challenges 3 1.2.1 Energy-Efficient On-Chip Caches based on STT-RAM 3 1.2.2 Intelligent Main Memory based on Logic-Enabled DRAM 6 1.3 Dissertation Overview 9 Chapter 2 Previous Work 11 2.1 Energy-Efficient On-Chip Caches based on STT-RAM 11 2.1.1 Hybrid Caches 11 2.1.2 Volatile STT-RAM 13 2.1.3 Redundant Write Elimination 14 2.2 Intelligent Main Memory based on Logic-Enabled DRAM 15 2.2.1 PIM Architectures in the 1990s 15 2.2.2 Modern PIM Architectures based on 3D Stacking 15 2.2.3 Modern PIM Architectures on Memory Dies 17 Chapter 3 Loop-Aware Sleepy Instruction Cache 19 3.1 Architecture 20 3.1.1 Loop Cache 21 3.1.2 Loop-Aware Sleep Controller 22 3.2 Evaluation and Discussion 24 3.2.1 Simulation Environment 24 3.2.2 Energy 25 3.2.3 Performance 27 3.2.4 Sensitivity Analysis 27 3.3 Summary 28 Chapter 4 Lower-Bits Cache 29 4.1 Architecture 29 4.2 Experiments 32 4.2.1 Simulator and Cache Model 32 4.2.2 Results 33 4.3 Summary 34 Chapter 5 Prediction Hybrid Cache 35 5.1 Problem and Motivation 37 5.1.1 Problem Definition 37 5.1.2 Motivation 37 5.2 Write Intensity Predictor 38 5.2.1 Keeping Track of Trigger Instructions 39 5.2.2 Identifying Hot Trigger Instructions 40 5.2.3 Dynamic Set Sampling 41 5.2.4 Summary 42 5.3 Prediction Hybrid Cache 43 5.3.1 Need for Write Intensity Prediction 43 5.3.2 Organization 43 5.3.3 Operations 44 5.3.4 Dynamic Threshold Adjustment 45 5.4 Evaluation Methodology 48 5.4.1 Simulator Configuration 48 5.4.2 Workloads 50 5.5 Single-Core Evaluations 51 5.5.1 Energy Consumption and Speedup 51 5.5.2 Energy Breakdown 53 5.5.3 Coverage and Accuracy 54 5.5.4 Sensitivity to Write Intensity Threshold 55 5.5.5 Impact of Dynamic Set Sampling 55 5.5.6 Results for Non-Write-Intensive Workloads 56 5.6 Multicore Evaluations 57 5.7 Summary 59 Chapter 6 Dead Write Prediction Assisted STT-RAM Cache 61 6.1 Motivation 62 6.1.1 Energy Impact of Inefficient Write Operations 62 6.1.2 Limitations of Existing Approaches 63 6.1.3 Potential of Dead Writes 64 6.2 Dead Write Classification 65 6.2.1 Dead-on-Arrival Fills 65 6.2.2 Dead-Value Fills 66 6.2.3 Closing Writes 66 6.2.4 Decomposition 67 6.3 Dead Write Prediction Assisted STT-RAM Cache Architecture 68 6.3.1 Dead Write Prediction 68 6.3.2 Bidirectional Bypass 71 6.4 Evaluation Methodology 72 6.4.1 Simulation Configuration 72 6.4.2 Workloads 74 6.5 Evaluation for Single-Core Systems 75 6.5.1 Energy Consumption and Speedup 75 6.5.2 Coverage and Accuracy 78 6.5.3 Sensitivity to Signature 78 6.5.4 Sensitivity to Update Policy 80 6.5.5 Implications of Device-/Circuit-Level Techniques for Write Energy Reduction 80 6.5.6 Impact of Prefetching 80 6.6 Evaluation for Multi-Core Systems 81 6.6.1 Energy Consumption and Speedup 81 6.6.2 Application to Inclusive Caches 83 6.6.3 Application to Three-Level Cache Hierarchy 84 6.7 Summary 85 Chapter 7 Link Power Management for Hybrid Memory Cubes 87 7.1 Background and Motivation 88 7.1.1 Hybrid Memory Cube 88 7.1.2 Motivation 89 7.2 HMC Link Power Management 91 7.2.1 Link Delay Monitor 91 7.2.2 Power State Transition 94 7.2.3 Overhead 95 7.3 Two-Level Prefetching 95 7.4 Application to Multi-HMC Systems 97 7.5 Experiments 98 7.5.1 Methodology 98 7.5.2 Link Energy Consumption and Speedup 100 7.5.3 HMC Energy Consumption 102 7.5.4 Runtime Behavior of LPM 102 7.5.5 Sensitivity to Slowdown Threshold 104 7.5.6 LPM without Prefetching 104 7.5.7 Impact of Prefetching on Link Traffic 105 7.5.8 On-Chip Prefetcher Aggressiveness in 2LP 107 7.5.9 Tighter Off-Chip Bandwidth Margin 107 7.5.10 Multithreaded Workloads 108 7.5.11 Multi-HMC Systems 109 7.6 Summary 111 Chapter 8 Tesseract PIM System for Parallel Graph Processing 113 8.1 Background and Motivation 115 8.1.1 Large-Scale Graph Processing 115 8.1.2 Graph Processing on Conventional Systems 117 8.1.3 Processing-in-Memory 118 8.2 Tesseract Architecture 119 8.2.1 Overview 119 8.2.2 Remote Function Call via Message Passing 122 8.2.3 Prefetching 124 8.2.4 Programming Interface 126 8.2.5 Application Mapping 127 8.3 Evaluation Methodology 128 8.3.1 Simulation Configuration 128 8.3.2 Workloads 129 8.4 Evaluation Results 130 8.4.1 Performance 130 8.4.2 Iso-Bandwidth Comparison 133 8.4.3 Execution Time Breakdown 134 8.4.4 Prefetch Efficiency 134 8.4.5 Scalability 135 8.4.6 Effect of Higher Off-Chip Network Bandwidth 136 8.4.7 Effect of Better Graph Distribution 137 8.4.8 Energy/Power Consumption and Thermal Analysis 138 8.5 Summary 139 Chapter 9 PIM-Enabled Instructions 141 9.1 Potential of ISA Extensions as the PIM Interface 143 9.2 PIM Abstraction 145 9.2.1 Operations 145 9.2.2 Memory Model 147 9.2.3 Software Modification 148 9.3 Architecture 148 9.3.1 Overview 148 9.3.2 PEI Computation Unit (PCU) 149 9.3.3 PEI Management Unit (PMU) 150 9.3.4 Virtual Memory Support 153 9.3.5 PEI Execution 153 9.3.6 Comparison with Active Memory Operations 154 9.4 Target Applications for Case Study 155 9.4.1 Large-Scale Graph Processing 155 9.4.2 In-Memory Data Analytics 156 9.4.3 Machine Learning and Data Mining 157 9.4.4 Operation Summary 157 9.5 Evaluation Methodology 158 9.5.1 Simulation Configuration 158 9.5.2 Workloads 159 9.6 Evaluation Results 159 9.6.1 Performance 160 9.6.2 Sensitivity to Input Size 163 9.6.3 Multiprogrammed Workloads 164 9.6.4 Balanced Dispatch: Idea and Evaluation 165 9.6.5 Design Space Exploration for PCUs 165 9.6.6 Performance Overhead of the PMU 167 9.6.7 Energy, Area, and Thermal Issues 167 9.7 Summary 168 Chapter 10 Aggregation-in-Memory 171 10.1 Motivation 173 10.1.1 Rethinking PIM for Energy Efficiency 173 10.1.2 Aggregation as PIM Operations 174 10.2 Architecture 176 10.2.1 Overview 176 10.2.2 Programming Model 177 10.2.3 On-Chip Caches 177 10.2.4 Coherence and Consistency 181 10.2.5 Main Memory 181 10.2.6 Potential Generalization Opportunities 183 10.3 Compiler Support 184 10.4 Contributions over Prior Art 185 10.4.1 PIM-Enabled Instructions 185 10.4.2 Parallel Reduction in Caches 187 10.4.3 Row Buffer Locality of DRAM Writes 188 10.5 Target Applications 188 10.6 Evaluation Methodology 190 10.6.1 Simulation Configuration 190 10.6.2 Hardware Overhead 191 10.6.3 Workloads 192 10.7 Evaluation Results 192 10.7.1 Energy Consumption and Performance 192 10.7.2 Dynamic Energy Breakdown 196 10.7.3 Comparison with Aggressive Writeback 197 10.7.4 Multiprogrammed Workloads 198 10.7.5 Comparison with Intrinsic-based Code 198 10.8 Summary 199 Chapter 11 Conclusion 201 11.1 Energy-Efficient On-Chip Caches based on STT-RAM 202 11.2 Intelligent Main Memory based on Logic-Enabled DRAM 203 Bibliography 205 요약 227Docto

    MRAM commercialization potential evaluation Research Based on the Chinese Market

    Get PDF
    Regarding the Chinese data storage industries, there is an urgent need for a national strategy and new investments as there are new technologies emerging in the global markets. The storage technology commercialization activities are becoming a widespread concern for the Chinese government and their strategic enterprises. The promotion of storage technology commercialization has become a common goal for enterprise and national government strategies. How the potential for commercialization of a storage technology can be assessed, what evaluation index should be used, and what the factors affect the storage technology are the important issues that must be addressed

    A Modern Primer on Processing in Memory

    Full text link
    Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today. At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend. This chapter discusses recent research that aims to practically enable computation close to data, an approach we call processing-in-memory (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked memory, or in the memory controllers), so that data movement between the computation units and memory is reduced or eliminated.Comment: arXiv admin note: substantial text overlap with arXiv:1903.0398

    Dependable Embedded Systems

    Get PDF
    This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems
    corecore