208 research outputs found

    A Modern Primer on Processing in Memory

    Full text link
    Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today. At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend. This chapter discusses recent research that aims to practically enable computation close to data, an approach we call processing-in-memory (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked memory, or in the memory controllers), so that data movement between the computation units and memory is reduced or eliminated.Comment: arXiv admin note: substantial text overlap with arXiv:1903.0398

    Architectural Techniques for Disturbance Mitigation in Future Memory Systems

    Get PDF
    With the recent advancements of CMOS technology, scaling down the feature size has improved memory capacity, power, performance and cost. However, such dramatic progress in memory technology has increasingly made the precise control of the manufacturing process below 22nm more difficult. In spite of all these virtues, the technology scaling road map predicts significant process variation from cell-to-cell. It also predicts electromagnetic disturbances among memory cells that easily deviate their circuit characterizations from design goals and pose threats to the reliability, energy efficiency and security. This dissertation proposes simple, energy-efficient and low-overhead techniques that combat the challenges resulting from technology scaling in future memory systems. Specifically, this dissertation investigates solutions tuned to particular types of disturbance challenges, such as inter-cell or intra-cell disturbance, that are energy efficient while guaranteeing memory reliability. The contribution of this dissertation will be threefold. First, it uses a deterministic counter-based approach to target the root of inter-cell disturbances in Dynamic random access memory (DRAM) and provide further benefits to overall energy consumption while deterministically mitigating inter-cell disturbances. Second, it uses Markov chains to reason about the reliability of Spin-Transfer Torque Magnetic Random-Access Memory (STT-RAM) that suffers from intra-cell disturbances and then investigates on-demand refresh policies to recover from the persistent effect of such disturbances. Third, It leverages an encoding technique integrated with a novel word level compression scheme to reduce the vulnerability of cells to inter-cell write disturbances in Phase Change Memory (PCM). However, mitigating inter-cell write disturbances and also minimizing the write energy may increase the number of updated PCM cells and result in degraded endurance. Hence, It uses multi-objective optimization to balance the write energy and endurance in PCM cells while mitigating intercell disturbances. The work in this dissertation provides important insights into how to tackle the critical reliability challenges that high-density memory systems confront in deep scaled technology nodes. It advocates for various memory technologies to guarantee reliability of future memory systems while incurring nominal costs in terms of energy, area and performance

    DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

    Full text link
    Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques to more memory-centric techniques, thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at https://github.com/CMU-SAFARI/DAMO

    Addressing Memory Bottlenecks for Emerging Applications

    Full text link
    There has been a recent emergence of applications from the domain of machine learning, data mining, numerical analysis and image processing. These applications are becoming the primary algorithms driving many important user-facing applications and becoming pervasive in our daily lives. Due to their increasing usage in both mobile and datacenter workloads, it is necessary to understand the software and hardware demands of these applications, and design techniques to match their growing needs. This dissertation studies the performance bottlenecks that arise when we try to improve the performance of these applications on current hardware systems. We observe that most of these applications are data-intensive, i.e., they operate on a large amount of data. Consequently, these applications put significant pressure on the memory. Interestingly, we notice that this pressure is not just limited to one memory structure. Instead, different applications stress different levels of the memory hierarchy. For example, training Deep Neural Networks (DNN), an emerging machine learning approach, is currently limited by the size of the GPU main memory. On the other spectrum, improving DNN inference on CPUs is bottlenecked by Physical Register File (PRF) bandwidth. Concretely, this dissertation tackles four such memory bottlenecks for these emerging applications across the memory hierarchy (off-chip memory, on-chip memory and physical register file), presenting hardware and software techniques to address these bottlenecks and improve the performance of the emerging applications. For on-chip memory, we present two scenarios where emerging applications perform at a sub-optimal performance. First, many applications have a large number of marginal bits that do not contribute to the application accuracy, wasting unnecessary space and transfer costs. We present ACME, an asymmetric compute-memory paradigm, that removes marginal bits from the memory hierarchy while performing the computation in full precision. Second, we tackle the contention in shared caches for these emerging applications that arise in datacenters where multiple applications can share the same cache capacity. We present ShapeShifter, a runtime system that continuously monitors the runtime environment, detects changes in the cache availability and dynamically recompiles the application on the fly to efficiently utilize the cache capacity. For physical register file, we observe that DNN inference on CPUs is primarily limited by the PRF bandwidth. Increasing the number of compute units in CPU requires increasing the read ports in the PRF. In this case, PRF quickly reaches a point where latency could no longer be met. To solve this problem, we present LEDL, locality extensions for deep learning on CPUs, that entails a rearchitected FMA and PRF design tailored for the heavy data reuse inherent in DNN inference. Finally, a significant challenge facing both the researchers and industry practitioners is that as the DNNs grow deeper and larger, the DNN training is limited by the size of the GPU main memory, restricting the size of the networks which GPUs can train. To tackle this challenge, we first identify the primary contributors to this heavy memory footprint, finding that the feature maps (intermediate layer outputs) are the heaviest contributors in training as opposed to the weights in inference. Then, we present Gist, a runtime system, that uses three efficient data encoding techniques to reduce the footprint of DNN training.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/146016/1/anijain_1.pd

    Bit-Flip Aware Data Structures for Phase Change Memory

    Get PDF
    Big, non-volatile, byte-addressable, low-cost, and fast non-volatile memories like Phase Change Memory are appearing in the marketplace. They have the capability to unify both memory and storage and allow us to rethink the present memory hierarchy. An important draw-back to Phase Change Memory is limited write-endurance. In addition, Phase Change Memory shares with other Non-Volatile Random Access Memories an asym- metry in the energy costs of writes and reads. Best use of Non-Volatile Random Access Memories limits the number of times a Non-Volatile Random Access Memory cell changes contents, called a bit-flip. While the future of main memory is still unknown, we should already start to create data structures for them in order to shape the future era. This thesis investigates the creation of bit-flip aware data structures.The thesis first considers general ways in which a data structure can save bit- flips by smart overwrites and by using the exclusive-or of pointers. It then shows how a simple content dependent encoding can reduce bit-flips for web corpora. It then shows how to build hash based dictionary structures for Linear Hashing and Spiral Storage. Finally, the thesis presents Gray counters, close to bit-flip optimal counters that even enable age- based wear leveling with counters managed by the Non-Volatile Random Access Memories themselves instead of by the Operating Systems

    Edge intelligence empowered metaverse: architecture, technologies, and open issues

    Get PDF
    Recently, the metaverse has emerged as a focal point of widespread interest, capturing attention across various domains. However, the construction of a pluralistic, realistic, and shared digital world is still in its infancy. Due to the ultra-strict requirements in security, intelligence, and real-time, it is urgent to solve the technical challenges existed in building metaverse ecosystems, such as ensuring the provision of seamless communication and reliable computing services in the face of a dynamic and time-varying complex network environment. In terms of digital infrastructure, edge computing (EC), as a distributed computing paradigm, has the potential to guarantee computing power, bandwidth, and storage. Meanwhile, artificial intelligence (AI) is regarded as a powerful tool to provide technical support for automated and efficient decision-making for metaverse devices. In this context, this paper focuses on integrating EC and AI to facilitate the development of the metaverse, namely, the edge intelligence-empowered metaverse. Specifically, we first outline the metaverse architecture and driving technologies and discuss EC as a key component of the digital infrastructure for metaverse realization. Then, we elaborate on two mainstream classifications of edge intelligence in metaverse scenarios, including AI for edge and AI on edge. Finally, we identify some open issues
    • …
    corecore