20 research outputs found

    A survey of near-data processing architectures for neural networks

    Get PDF
    Data-intensive workloads and applications, such as machine learning (ML), are fundamentally limited by traditional computing systems based on the von-Neumann architecture. As data movement operations and energy consumption become key bottlenecks in the design of computing systems, the interest in unconventional approaches such as Near-Data Processing (NDP), machine learning, and especially neural network (NN)-based accelerators has grown significantly. Emerging memory technologies, such as ReRAM and 3D-stacked, are promising for efficiently architecting NDP-based accelerators for NN due to their capabilities to work as both high-density/low-energy storage and in/near-memory computation/search engine. In this paper, we present a survey of techniques for designing NDP architectures for NN. By classifying the techniques based on the memory technology employed, we underscore their similarities and differences. Finally, we discuss open challenges and future perspectives that need to be explored in order to improve and extend the adoption of NDP architectures for future computing platforms. This paper will be valuable for computer architects, chip designers, and researchers in the area of machine learning.This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00, and the ICREA Academia program.Peer ReviewedPostprint (published version

    Replica Bit-Line Technique for Embedded Multilevel Gain-Cell DRAM

    Get PDF
    Multilevel gain-cell DRAMs are interesting to improve the area-efficiency of modern fault-tolerant systems-on-chip implemented in deep-submicron CMOS technologies. This paper addresses the problem of long access times in such multilevel gain-cell DRAMs, which are further aggravated by process parameter variations. A replica bit-line (BL) technique, previously proposed for SRAM, is adapted to speed up the multilevel read operation at a negligible area-increase. Moreover, the same replica column is used to improve the write access time. An 8-kb DRAM macro implemented in 90-nm CMOS technology shows that the replica column is able to successfully track die-to-die process, voltage, and temperature variations to generate control signals with optimum delay. Finally, Monte-Carlo simulations show that a small timing margin of 100 ps is sufficient to also cope with within-die process variations

    Design and failure analysis of logic-compatible multilevel gain-cell-based DRAM for fault-tolerant VLSI systems

    Get PDF
    This paper considers the problem of increasing the storage density in fault-tolerant VLSI systems which require only limited data retention times. To this end, the concept of storing many bits per memory cell is applied to area-efficient and fully logic-compatible gain-cell-based dynamic memories. A memory macro in 90-nm CMOS technology including multilevel write and read circuits is proposed and analyzed with respect to its read failure probability due to within-die process variations by means of Monte Carlo simulations

    Understanding and Improving the Latency of DRAM-Based Memory Systems

    Full text link
    Over the past two decades, the storage capacity and access bandwidth of main memory have improved tremendously, by 128x and 20x, respectively. These improvements are mainly due to the continuous technology scaling of DRAM (dynamic random-access memory), which has been used as the physical substrate for main memory. In stark contrast with capacity and bandwidth, DRAM latency has remained almost constant, reducing by only 1.3x in the same time frame. Therefore, long DRAM latency continues to be a critical performance bottleneck in modern systems. Increasing core counts, and the emergence of increasingly more data-intensive and latency-critical applications further stress the importance of providing low-latency memory access. In this dissertation, we identify three main problems that contribute significantly to long latency of DRAM accesses. To address these problems, we present a series of new techniques. Our new techniques significantly improve both system performance and energy efficiency. We also examine the critical relationship between supply voltage and latency in modern DRAM chips and develop new mechanisms that exploit this voltage-latency trade-off to improve energy efficiency. The key conclusion of this dissertation is that augmenting DRAM architecture with simple and low-cost features, and developing a better understanding of manufactured DRAM chips together lead to significant memory latency reduction as well as energy efficiency improvement. We hope and believe that the proposed architectural techniques and the detailed experimental data and observations on real commodity DRAM chips presented in this dissertation will enable development of other new mechanisms to improve the performance, energy efficiency, or reliability of future memory systems.Comment: PhD Dissertatio

    Computing with Spintronics: Circuits and architectures

    Get PDF
    This thesis makes the following contributions towards the design of computing platforms with spintronic devices. 1) It explores the use of spintronic memories in the design of a domain-specific processor for an emerging class of data-intensive applications, namely recognition, mining and synthesis (RMS). Two different spintronic memory technologies — Domain Wall Memory (DWM) and STT-MRAM — are utilized to realize the different levels in the memory hierarchy of the domain-specific processor, based on their respective access characteristics. Architectural tradeoffs created by the use of spintronic memories are analyzed. The proposed design achieves 1.5X-4X improvements in energy-delay product compared to a CMOS baseline. 2) It describes the first attempt to use DWM in the cache hierarchy of general-purpose processors. DWM promises unparalleled density by packing several bits of data into each bit-cell. TapeCache, the proposed DWM-based cache architecture, utilizes suitable circuit and architectural optimizations to address two key challenges (i) the high energy and latency requirement of write operations and (ii) the need for shift operations to access the data stored in each DWM bit-cell. At the circuit level, DWM bit-cells that are tailored to the distinct design requirements of different levels in the cache hierarchy are proposed. At the architecture level, TapeCache proposes suitable cache organization and management policies to alleviate the performance impact of shift operations required to access data stored in DWM bit-cells. TapeCache achieves more than 7X improvements in both cache area and energy with virtually identical performance compared to an SRAM-based cache hierarchy. 3) It investigates the design of the on-chip memory hierarchy of general-purpose graphics processing units (GPGPUs)—massively parallel processors that are optimized for data-intensive high-throughput workloads—using DWM. STAG, a high density, energy-efficient Spintronic- Tape Architecture for GPGPU cache hierarchies is described. STAG utilizes different DWM bit-cells to realize different memory arrays in the GPGPU cache hierarchy. To address the challenge of high access latencies due to shifts, STAG predicts upcoming cache accesses by leveraging unique characteristics of GPGPU architectures and workloads, and prefetches data that are both likely to be accessed and require large numbers of shift operations. STAG achieves 3.3X energy reduction and 12.1% performance improvement over CMOS SRAM under iso-area conditions. 4) While the potential of spintronic devices for memories is widely recognized, their utility in realizing logic is much less clear. The thesis presents Spintastic, a new paradigm that utilizes Stochastic Computing (SC) to realize spintronic logic. In SC, data is encoded in the form of pseudo-random bitstreams, such that the probability of a \u271\u27 in a bitstream corresponds to the numerical value that it represents. SC can enable compact, low-complexity logic implementations of various arithmetic functions. Spintastic establishes the synergy between stochastic computing and spin-based logic by demonstrating that they mutually alleviate each other\u27s limitations. On the one hand, various building blocks of SC, which incur significant overheads in CMOS implementations, can be efficiently realized by exploiting the physical characteristics of spin devices. On the other hand, the reduced logic complexity and low logic depth of SC circuits alleviates the shortcomings of spintronic logic. Based on this insight, the design of spin-based stochastic arithmetic circuits, bitstream generators, bitstream permuters and stochastic-to-binary converter circuits are presented. Spintastic achieves 7.1X energy reduction over CMOS implementations for a wide range of benchmarks from the image processing, signal processing, and RMS application domains. 5) In order to evaluate the proposed spintronic designs, the thesis describes various device-to-architecture modeling frameworks. Starting with devices models that are calibrated to measurements, the characteristics of spintronic devices are successively abstracted into circuit-level and architectural models, which are incorporated into suitable simulation frameworks. (Abstract shortened by UMI.

    DISK DESIGN-SPACE EXPLORATION IN TERMS OF SYSTEM-LEVEL PERFORMANCE, POWER, AND ENERGY CONSUMPTION

    Get PDF
    To make the common case fast, most studies focus on the computation phase of applications in which most instructions are executed. However, many programs spend significant time in the I/O intensive phase due to the I/O latency. To obtain a system with more balanced phases, we require greater insight into the effects of the I/O configurations to the entire system in both performance and power dissipation domains. Due to lack of public tools with the complete picture of the entire memory hierarchy, we developed SYSim. SYSim is a complete-system simulator aiming at complete memory hierarchy studies in both performance and power consumption domains. In this dissertation, we used SYSim to investigate the system-level impacts of several disk enhancements and technology improvements to the detailed interaction in memory hierarchy during the I/O-intensive phase. The experimental results are reported in terms of both total system performance and power/energy consumption. With SYSim, we conducted the complete-system experiments and revealed intriguing behaviors including, but not limited to, the following: During the I/O intensive phase which consists of both disk reads and writes, the average system CPI tracks only average disk read response time, and not overall average disk response time, which is the widely-accepted metric in disk drive research. In disk read-dominating applications, Disk Prefetching is more important than increasing the disk RPM. On the other hand, in applications with both disk reads and writes, the disk RPM matters. The execution time can be improved to an order of magnitude by applying some disk enhancements. Using disk caching and prefetching can improve the performance by the factor of 2, and write-buffering can improve the performance by the factor of 10. Moreover, using disk caching/prefetching and the write-buffering techniques in conjunction can improve the total system performance by at least an order of magnitude. Increasing the disk RPM and the number of disks in RAID disk system also have an impressive improvement over the total system performance. However, employing such techniques requires careful consideration for trade-offs in power/energy consumption

    Emerging Run-Time Reconfigurable FPGA and CAD Tools

    Get PDF
    Field-programmable gate array (FPGA) is a post fabrication reconfigurable device to accelerate domain specific computing systems. It offers offer high operation speed and low power consumption. However, the design flexibility and performance of FPGAs are severely constrained by the costly on-chip memories, e.g. static random access memory (SRAM) and FLASH memory. The objective of my dissertation is to explore the opportunity and enable the use of the emerging resistance random access memory (ReRAM) in FPGA design. The emerging ReRAM technology features high storage density, low access power consumption, and CMOS compatibility, making it a promising candidate for FPGA implementation. In particular, ReRAM has advantages of the fast access and nonvolatility, enabling the on-chip storage and access of configuration data. In this dissertation, I first propose a novel three-dimensional stacking scheme, namely, high-density interleaved memory (HIM). The structure improves the density of ReRAM meanwhile effectively reducing the signal interference induced by sneak paths in crossbar arrays. To further enhance the access speed and design reliability, a fast sensing circuit is also presented which includes a new sense amplifier scheme and reference cell configuration. The proposed ReRAM FPGA leverages a similar architecture as conventional SRAM based FPGAs but utilizes ReRAM technology in all component designs. First, HIM is used to implement look-up table (LUT) and block random access memories (BRAMs) for func- tionality process. Second, a 2R1T, two ReRAM cells and one transistor, nonvolatile switch design is applied to construct connection blocks (CBs) and switch blocks (SBs) for signal transition. Furthermore, unified BRAM (uBRAM) based on the current BRAM architecture iv is introduced, offering both configuration and temporary data storage. The uBRAMs provides extremely high density effectively and enlarges the FPGA capacity, potentially saving multiple contexts of configuration. The fast configuration scheme from uBRAM to logic and routing components also makes fast run-time partial reconfiguration (PR) much easier, improving the flexibility and performance of the entire FPGA system. Finally, modern place and route tools are designed for homogeneous fabric of FPGA. The PR feature, however, requires the support of heterogeneous logic modules in order to differentiate PR modules from static ones and therefore maintain the signal integration. The existing approaches still reply on designers’ manual effort, which significantly prolongs design time and lowers design efficiency. In this dissertation, I integrate PR support into VPR – an academic place and route tool by introducing a B*-tree modular placer (BMP) and PR-aware router. As such, users are able to explore new architectures or map PR applications to a variety of FPGAs. More importantly, this enhanced feature can also support fast design automation, e.g. mapping IP core, loading pre-synthesizing logic modules, etc

    Architecture and Circuit Design Optimization for Compute-In-Memory

    Get PDF
    The objective of the proposed research is to optimize computing-in-memory (CIM) design for accelerating Deep Neural Network (DNN) algorithms. As compute peripheries such as analog-to-digital converter (ADC) introduce significant overhead in CIM inference design, the research first focuses on the circuit optimization for inference acceleration and proposes a resistive random access memory (RRAM) based ADC-free in-memory compute scheme. We comprehensively explore the trade-offs involving different types of ADCs and investigate a new ADC design especially suited for the CIM, which performs the analog shift-add for multiple weight significance bits, improving the throughput and energy efficiency under similar area constraints. Furthermore, we prototype an ADC-free CIM inference chip design with a fully-analog data processing manner between sub-arrays, which can significantly improve the hardware performance over the conventional CIM designs and achieve near-software classification accuracy on ImageNet and CIFAR-10/-100 dataset. Secondly, the research focuses on hardware support for CIM on-chip training. To maximize hardware reuse of CIM weight stationary dataflow, we propose the CIM training architectures with the transpose weight mapping strategy. The cell design and periphery circuitry are modified to efficiently support bi-directional compute. A novel solution of signed number multiplication is also proposed to handle the negative input in backpropagation. Finally, we propose an SRAM-based CIM training architecture and comprehensively explore the system-level hardware performance for DNN on-chip training based on silicon measurement results.Ph.D
    corecore