22 research outputs found

    Reliable and Energy Efficient MLC STT-RAM Buffer for CNN Accelerators

    Get PDF
    We propose a lightweight scheme where the formation of a data block is changed in such a way that it can tolerate soft errors significantly better than the baseline. The key insight behind our work is that CNN weights are normalized between -1 and 1 after each convolutional layer, and this leaves one bit unused in half-precision floating-point representation. By taking advantage of the unused bit, we create a backup for the most significant bit to protect it against the soft errors. Also, considering the fact that in MLC STT-RAMs the cost of memory operations (read and write), and reliability of a cell are content-dependent (some patterns take larger current and longer time, while they are more susceptible to soft error), we rearrange the data block to minimize the number of costly bit patterns. Combining these two techniques provides the same level of accuracy compared to an error-free baseline while improving the read and write energy by 9% and 6%, respectively

    Implications of Noise in Resistive Memory on Deep Neural Networks for Image Classification

    Full text link
    Resistive memory is a promising alternative to SRAM, but is also an inherently unstable device that requires substantial effort to ensure correct read and write operations. To avoid the associated costs in terms of area, time and energy, the present work is concerned with exploring how much noise in memory operations can be tolerated by image classification tasks based on neural networks. We introduce a special noisy operator that mimics the noise in an exemplary resistive memory unit, explore the resilience of convolutional neural networks on the CIFAR-10 classification task, and discuss a couple of countermeasures to improve this resilience

    OCC:An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory

    Get PDF
    Memristive devices promise an alternative approach toward non-Von Neumann architectures, where specific computational tasks are performed within the memory devices. In the machine learning (ML) domain, crossbar arrays of resistive devices have shown great promise for ML inference, as they allow for hardware acceleration of matrix multiplications. But, to enable widespread adoption of these novel architectures, it is critical to have an automatic compilation flow as opposed to relying on a manual mapping of specific kernels on the crossbar arrays. We demonstrate the programmability of memristor-based accelerators using the new compiler design principle of multilevel rewriting, where a hierarchy of abstractions lowers programs level-by-level and perform code transformations at the most suitable abstraction. In particular, we develop a prototype compiler, which progressively lowers a mathematical notation for tensor operations arising in ML workloads, to fixed-function memristor-based hardware blocks.</p

    PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation

    Full text link
    Bit-serial Processing-In-Memory (PIM) is an attractive paradigm for accelerator architectures, for parallel workloads such as Deep Learning (DL), because of its capability to achieve massive data parallelism at a low area overhead and provide orders-of-magnitude data movement savings by moving computational resources closer to the data. While many PIM architectures have been proposed, improvements are needed in communicating intermediate results to consumer kernels, for communication between tiles at scale, for reduction operations, and for efficiently performing bit-serial operations with constants. We present PIMSAB, a scalable architecture that provides spatially aware communication network for efficient intra-tile and inter-tile data movement and provides efficient computation support for generally inefficient bit-serial compute patterns. Our architecture consists of a massive hierarchical array of compute-enabled SRAMs (CRAMs) and is codesigned with a compiler to achieve high utilization. The key novelties of our architecture are: (1) providing efficient support for spatially-aware communication by providing local H-tree network for reductions, by adding explicit hardware for shuffling operands, and by deploying systolic broadcasting, and (2) taking advantage of the divisible nature of bit-serial computations through adaptive precision, bit-slicing and efficient handling of constant operations. When compared against a similarly provisioned modern Tensor Core GPU (NVIDIA A100), across common DL kernels and an end-to-end DL network (Resnet18), PIMSAB outperforms the GPU by 3x, and reduces energy by 4.2x. We compare PIMSAB with similarly provisioned state-of-the-art SRAM PIM (Duality Cache) and DRAM PIM (SIMDRAM) and observe a speedup of 3.7x and 3.88x respectively.Comment: Aman Arora and Jian Weng are co-first authors with equal contributio

    Low-Cost Floating-Point Processing in ReRAM for Scientific Computing

    Full text link
    We propose ReFloat, a principled approach for low-cost floating-point processing in ReRAM. The exponent offsets based on a base are stored by a flexible and fine-grained floating-point number representation. The key motivation is that, while the number of exponent bits must be reduced due to the exponential relation to the computation latency and hardware cost, the convergence still requires sufficient accuracy for exponents. Our design reconciles the conflicting goals by storing the exponent offsets from a common base among matrix values in a block, which is the granularity of computation in ReRAM. Due to the value locality, the differences among the exponents in a block are small, thus the offsets require much less number of bits to represent exponents. In essence, ReFloat enables the principled local fine-tuning of floating-point representation. Based on the idea, we define a flexible ReFloat format that specifies matrix block size, and the number of bits for exponent and fraction. To determine the base for each block, we propose an optimization method that minimizes the difference between the exponents of the original matrix block and the converted block. We develop the conversion scheme from default double-precision floating-point format to ReFloat format, the computation procedure, and the low-cost floating-point processing architecture in ReRAM
    corecore