21 research outputs found

    Efficient Sneak Path-aware Training of Binarized Neural Networks for RRAM Crossbar Arrays

    Get PDF
    Department of Computer Science and EngineeringAlthough RRAM crossbar arrays have been suggested as an efficient way to implement MVM for DNNS, the sneak path problem of RRAM crossbar arrays due to wire resistance can distort the result of MVM quite significantly, resulting harsh performance degradation of the network. Therefore, a software solution that can predict the effect of sneak paths to mitigate the impact without permanent hardware cost or expensive SPICE simulations is very desirable. In this paper, a novel method to incorporate the sneak path problem during training with a negligible overhead is proposed. The test validation results, done through accurate SPICE simulations, show very high improvement in the performance close to the baseline BNNs on GPU, which demonstrates the efficiency of the proposed method to capture the sneak path problem.clos

    IR-QNN Framework: An IR Drop-Aware Offline Training Of Quantized Crossbar Arrays

    Get PDF
    Resistive Crossbar Arrays present an elegant implementation solution for Deep Neural Networks acceleration. The Matrix-Vector Multiplication, which is the corner-stone of DNNs, is carried out in O(1) compared to O(N-2) steps for digital realizations of O(log(2)(N)) steps for in-memory associative processors. However, the IR drop problem, caused by the inevitable interconnect wire resistance in RCAs remains a daunting challenge. In this article, we propose a fast and efficient training and validation framework to incorporate the wire resistance in Quantized DNNs, without the need for computationally extensive SPICE simulations during the training process. A fabricated four-bit Au/Al2O3/HfO2/TiN device is modelled and used within the framework with two-mapping schemes to realize the quantized weights. Efficient system-level IR-drop estimation methods are used to accelerate training. SPICE validation results show the effectiveness of the proposed method to capture the IR drop problem achieving the baseline accuracy with a 2% and 4% drop in the worst-case scenario for MNIST dataset on multilayer perceptron network and CIFAR 10 dataset on modified VGG and AlexNet networks, respectively. Other nonidealities, such as stuck-at fault defects, variability, and aging, are studied. Finally, the design considerations of the neuronal and the driver circuits are discussed

    On-chip memory optimization for high-level synthesis of multi-dimensional data on FPGA

    No full text
    It is very challenging to design an on-chip memory architecture for high-performance kernels with large amount of computation and data. The on-chip memory architecture must support efficient data access from both the computation part and the external memory part, which often have very different expectations about how data should be accessed and stored. Previous work provides only a limited set of optimizations. In this paper we show how to fundamentally restructure on-chip buffers, by decoupling logical array view from the physical buffer view, and providing general mapping schemes for the two. Our framework considers the entire data flow from the external memory to the computation part in order to minimize resource usage without creating performance bottleneck. Our experimental results demonstrate that our proposed technique can generate solutions that reduce memory usage significantly (2X over the conventional method), and successfully generate optimized on-chip buffer architectures without costly design iterations for highly optimized computation kernels

    Automated Log-Scale Quantization for Low-Cost Deep Neural Networks

    No full text
    Quantization plays an important role in deep neural network (DNN) hardware. In particular, logarithmic quantization has multiple advantages for DNN hardware implementations, and its weakness in terms of lower performance at high precision compared with linear quantization has been recently remedied by what we call selective two-word logarithmic quantization (STLQ). However, there is a lack of training methods designed for STLQ or even logarithmic quantization in general. In this paper we propose a novel STLQ-aware training method, which significantly outperforms the previous state-of-the-art training method for STLQ. Moreover, our training results demonstrate that with our new training method, STLQ applied to weight parameters of ResNet-18 can achieve the same level of performance as state-of-the-art quantization method, APoT, at 3-bit precision. We also apply our method to various DNNs in image enhancement and semantic segmentation, showing competitive results

    Successive log quantization for cost-efficient neural networks using stochastic computing

    No full text
    Despite the multifaceted benefits of stochastic computing (SC) such as low cost, low power, and flexible precision, SC-based deep neural networks (DNNs) still suffer from the long-latency problem, especially for those with high precision requirements. While log quantization can be of help, it has its own accuracy-saturation problem due to uneven precision distribution. In this paper we propose successive log quantization (SLQ), which extends log quantization with significant improvements in precision and accuracy, and apply it to state-of-the-art SC-DNNs. SLQ reuses the existing datapath of log quantization, and thus retains its advantages such as simple multiplier hardware. Our experimental results demonstrate that our SLQ can significantly extend both the accuracy and efficiency of SCDNNs over the state-of-the-art solutions, including linear-quantized and log-quantized SC-DNNs, achieving less than 1???1.5%p accuracy drop for AlexNet, SqueezeNet, and VGG-S at mere 4???5-bit weight resolution. ?? 2019 Copyright held by the owner/author(s)

    Double MAC on a DSP: Boosting the Performance of Convolutional Neural Networks on FPGAs

    No full text
    Deep learning such as Convolutional Neural Networks (CNNs) are an important workload increasingly demanding high-performance hardware acceleration. One distinguishing feature of deep learnng workload is that it is inherently resilient to small numerical errors and works very well with low precision hardware. Thus we propose a novel method, called Double MAC, to theoretically double the computation rate of CNN accelerators by packing two multiply-and-accumulate (MAC) operations into one DSP block of off-the-shelf FPGAs. There are several technical challenges, which we overcome by exploiting the mode of operation in the CNN accelerator. We have validated our method through FPGA synthesis and Verilog simulation, and evaluated our method by applying it to the state-of-the-art CNN accelerator. We find that our Double MAC approach can increase the computation throughput of a CNN layer by twice. On the network level (all convolution layers combined), the performance improvement varies depending on the CNN application and FPGA size, from 14% to more than 80% over a highly optimized state-of-the-art accelerator solution, without sacrificing the output quality significantly

    Offline Training-based Mitigation of IR Drop for ReRAM-based Deep Neural Network Accelerators

    No full text
    Recently, ReRAM-based hardware accelerators showed unprecedented performance compared the digital accelerators. Technology scaling causes an inevitable increase in interconnect wire resistance, which leads to IR drops that could limit the performance of ReRAM-based accelerators. These IR drops deteriorate the signal integrity and quality especially in the Crossbar structures which are used to build high density ReRAMs. Hence, finding a software solution that can predict the effect of IR drop without involving expensive hardware or SPICE simulations, is very desirable. In this paper, we propose two neural networks models to predict the impact of the IR drop problem. These models are uded to evaluate the performance of the different deep neural networks (DNNs) models including binary and quantized neural networks showing similar performance (i.e., recognition accuracy) to the golden validation (i.e., SPICE-based DNN validation). In addition, these predication models are incorporated into DNNs training framework to efficiently retrain the DNN models and bridge the accuracy drop. To further enhance the validation accuracy, we propose incremental training methods. The DNN validation results, done through SPICE simulations, show very high improvement in performance close to the baseline performance, which demonstrates the efficacy of the proposed method even with challenging datasets such as CIFAR10 and SVHN

    Accurate Prediction of ReRAM Crossbar Performance Under I-V Nonlinearity and IR Drop

    No full text

    MLogNet: A Logarithmic Quantization-Based Accelerator for Depthwise Separable Convolution

    No full text
    In this paper we propose a novel logarithmic quantization-based DNN (Deep Neural Network) architecture for depthwise separable convolution (DSC) networks. Our architecture is based on selective two-word logarithmic quantization (STLQ), which improves accuracy greatly over logarithmic-scale quantization while retaining the speed and area advantage of logarithmic quantization. On the other hand, it also comes with the synchronization problem due to variable-latency PEs (processing elements), which we address through a novel architecture and a compile-time optimization technique. Our architecture is dynamically reconfigurable to support various combinations of depthwise vs. pointwise convolution layers efficiently. Our experimental results using layers from MobileNetV2 and ShuffleNetV2 demonstrate that our architecture is significantly faster and more area-efficient than previous DSC accelerator architectures as well as previous accelerators utilizing logarithmic quantization

    Fast and Low-Cost Mitigation of ReRAM Variability for Deep Learning Applications

    No full text
    To overcome the programming variability (PV) of ReRAM crossbar arrays (RCAs), the most common method is program-verify, which, however, has high energy and latency overhead. In this paper we propose a very fast and low-cost method to mitigate the effect of PV and other variability for RCA-based DNN (Deep Neural Network) accelerators. Leveraging the statistical properties of DNN output, our method called Online Batch-Norm Correction (OBNC) can compensate for the effect of programming and other variability on RCA output without using on-chip training or an iterative procedure, and is thus very fast. Also our method does not require a nonideality model or a training dataset, hence very easy to apply. Our experimental results using ternary neural networks with binary and 4-bit activations demonstrate that our OBNC can recover the baseline performance in many variability settings and that our method outperforms a previously known method (VCAM) by large margins when input distribution is asymmetric or activation is multi-bit
    corecore