6 research outputs found

    COMPUTE-IN-MEMORY WITH EMERGING NON-VOLATILE MEMORIES FOR ACCELERATING DEEP NEURAL NETWORKS

    Get PDF
    The objective of this research is to accelerate deep neural networks (DNNs) with emerging non-volatile memories (eNVMs) based compute-in-memory (CIM) architecture. The research first focuses on the inference acceleration and proposes a resistive random access memory (RRAM) based CIM architecture. Two generations of RRAM testchips which monolithically integrate the RRAM memory array and CMOS peripheral circuits are designed and fabricated using Winbond 90 nm and TSMC 40 nm commercial embedded RRAM process respectively. The first generation of testchip named XNOR-RRAM is dedicated for binary neural networks (BNNs) and the second generation named Flex-RRAM features 1bit-to-8bit run-time configurable precision and leverages the input sparsity of the DNN model to improve the throughput and energy efficiency. However, the non-ideal characteristics of eNVM devices, especially when utilized as multi-level analog synaptic weights, may incur a notable accuracy degradation for both training and inference. This research develops a PyTorch based framework that incorporates the device characteristics into the DNN model to evaluate the impact of the eNVM nonidealities on training/inference accuracy. The results suggest that it is challenging to directly use eNVMs for in-situ training and resistance drift remains as a critical challenge to maintain a high inference accuracy. Furthermore, to overcome the challenges posed by the asymmetric conductance tuning behavior of typical eNVMs, which is found to be the most critical nonideality that prevents the model from achieving software equivalent training accuracy, this research proposes a novel 2-transistor-1-FeFET (ferroelectric field effect transistor) based synaptic weight cell that exploits hybrid precision for in situ training and inference, which achieves near-software classification accuracy on MNIST and CIFAR-10 dataset.Ph.D

    In-memory computing with emerging memory devices: Status and outlook

    Get PDF
    Supporting data for "In-memory computing with emerging memory devices: status and outlook", submitted to APL Machine Learning

    TinyVers: A Tiny Versatile System-on-chip with State-Retentive eMRAM for ML Inference at the Extreme Edge

    Full text link
    Extreme edge devices or Internet-of-thing nodes require both ultra-low power always-on processing as well as the ability to do on-demand sampling and processing. Moreover, support for IoT applications like voice recognition, machine monitoring, etc., requires the ability to execute a wide range of ML workloads. This brings challenges in hardware design to build flexible processors operating in ultra-low power regime. This paper presents TinyVers, a tiny versatile ultra-low power ML system-on-chip to enable enhanced intelligence at the Extreme Edge. TinyVers exploits dataflow reconfiguration to enable multi-modal support and aggressive on-chip power management for duty-cycling to enable smart sensing applications. The SoC combines a RISC-V host processor, a 17 TOPS/W dataflow reconfigurable ML accelerator, a 1.7 μ\muW deep sleep wake-up controller, and an eMRAM for boot code and ML parameter retention. The SoC can perform up to 17.6 GOPS while achieving a power consumption range from 1.7 μ\muW-20 mW. Multiple ML workloads aimed for diverse applications are mapped on the SoC to showcase its flexibility and efficiency. All the models achieve 1-2 TOPS/W of energy efficiency with power consumption below 230 μ\muW in continuous operation. In a duty-cycling use case for machine monitoring, this power is reduced to below 10 μ\muW.Comment: Accepted in IEEE Journal of Solid-State Circuit

    Simulation and programming strategies to mitigate device non-idealities in memristor based neuromorphic systems

    Get PDF
    Since its inception, resistive random access memory (RRAM) has widely been regarded as a promising technology, not only for its potential to revolutionize non-volatile data storage by bridging the speed gap between traditional solid state drives (SSD) and dynamic random access memory (DRAM), but also for the promise it brings to in-memory and neuromorphic computing. Despite the potential, the design process of RRAM neuromorphic arrays still finds itself in its infancy, as reliability (retention, endurance, programming linearity) and variability (read-to-read, cycle-to-cycle and device-to-device) issues remain major hurdles for the mainstream implementation of these systems. One of the fundamental stages of neuromorphic design is the simulation stage. In this thesis, a simulation framework for evaluating the impact of RRAM non-idealities on NNs, that emphasizes flexibility and experimentation in NN topology and RRAM programming conditions is coded in MATLAB, making full use of its various toolboxes. Using these tools as the groundwork, various RRAM non-idealities are comprehensively measured and their impact on both inference and training accuracy of a pattern recognition system based on the MNIST handwritten digits dataset are simulated. In the inference front, variability originated from different sources (read-to-read and programming-to-programming) are statistically evaluated and modelled for two different device types: filamentary and non-filamentary. Based on these results, the impact of various variability sources on inference are simulated and compared, showing much more pronounced variability in the filamentary device compared to its non-filamentary counterpart. The staged programming scheme is introduced as a method to improve linearity and reduce programming variability, leading to negligible accuracy loss in non-filamentary devices. Random telegraph noise (RTN) remains the major source of read variability in both devices. These results can be explained by the difference in switching mechanisms of both devices. In training, non-idealities such as conductance stepping and cycle-to-cycle variability are characterized and their impact on the training of NNs based on backpropagation are independently evaluated. Analysing the change of weight distributions during training reveals the different impacts on the SET and RESET processes. Based on these findings, a new selective programming strategy is introduced for the suppression of non-idealities impact on accuracy. Furthermore, the impact of these methods are analysed between different NN topologies, including traditional multi-layer perceptron (MLP) and convolutional neural network (CNN) configurations. Finally, the new dynamic weight range rescaling methodology is introduced as a way of not only alleviating the constraints imposed in hardware due to the limited conductance range of RRAM in training, but also as way of increasing the flexibility of RRAM based deep synaptic layers to different sets of data

    Hardware-Friendly Model Compression techniques for Deep Learning Accelerators

    Get PDF
    The objective of the proposed research is to introduce solutions to make energy-efficient \gls{dnn} accelerators to be deployable on edge devices through developing hardware-aware \gls{dnn} compression methods. The rising popularity of intelligent mobile devices and the computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. In particular, we proposed four compression techniques for energy and memory efficient \gls{dnn} computing. In the first method, \gls{lgps}, we present a hardware-aware pruning method where the locations of non-zero weights are derived in real-time from a \gls{lfsr}. Using the proposed method, we demonstrate a total saving of energy and area up to 63.96\% and 64.23\% for VGG-16 network on down-sampled ImageNet, respectively for iso-compression-rate and iso-accuracy. Secondly, We achieved ultra-low bit-precision deep learning model by developing a quantization scheme through knowledge distillation and gradual quantization for pruned network. Thirdly, we propose a novel model compression scheme that allows inference to be carried out using bit-level sparsity, which can be efficiently implemented using in-memory computing macros. We introduce a method called BitS-Net to leverage the benefits of bit-sparsity (where the number of zeros is more than number of ones in binary representation of weight/activation values) when applied to Compute-In-Memory (\gls{cim}) with Resistive Random Access Memory (\gls{rram}) to develop energy efficient \gls{dnn} accelerators operating in the inference mode. We demonstrate that BitS-Net improves the energy efficiency by up to 5x for ResNet models on the ImageNet dataset. In the last part, to achieve highly energy-efficient DNN, we introduce a novel twofold sparsity method (Twofold Sparsity, twofoldS-Net) to sparsify the DNN models in bit- and network-level, simultaneously. We added two separate regularizations to the loss function in order to achieve bit- and network-level sparsity at the same time. We sparsify the model in network-level, by adding a mask generated by \gls{lfsr}. For bit-level sparsity, we quantize the network to 8-bit representation in two's complement format. During inference we take advantage of \gls{cim} architecture and \gls{lfsr} indexing. We have shown that by using our proposed method we are able to sparsify the network and design a highly energy-efficient deep learning accelerator to eventually bring \gls{ai} to our daily lives.Ph.D
    corecore