Search CORE

21 research outputs found

Efficient Sneak Path-aware Training of Binarized Neural Networks for RRAM Crossbar Arrays

Author: Lee Sugil
Publication venue: Graduate School of UNIST
Publication date: 01/08/2019
Field of study

Department of Computer Science and EngineeringAlthough RRAM crossbar arrays have been suggested as an efficient way to implement MVM for DNNS, the sneak path problem of RRAM crossbar arrays due to wire resistance can distort the result of MVM quite significantly, resulting harsh performance degradation of the network. Therefore, a software solution that can predict the effect of sneak paths to mitigate the impact without permanent hardware cost or expensive SPICE simulations is very desirable. In this paper, a novel method to incorporate the sneak path problem during training with a negligible overhead is proposed. The test validation results, done through accurate SPICE simulations, show very high improvement in the performance close to the baseline BNNs on GPU, which demonstrates the efficiency of the proposed method to capture the sneak path problem.clos

ScholarWorks@UNIST

IR-QNN Framework: An IR Drop-Aware Offline Training Of Quantized Crossbar Arrays

Author: Eltawil Ahmed
Fouda Mohammed E.
Kim Gun Hwan
Kurdahi Fadi
Lee Jongeun
Lee Sugil
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2020
Field of study

Resistive Crossbar Arrays present an elegant implementation solution for Deep Neural Networks acceleration. The Matrix-Vector Multiplication, which is the corner-stone of DNNs, is carried out in O(1) compared to O(N-2) steps for digital realizations of O(log(2)(N)) steps for in-memory associative processors. However, the IR drop problem, caused by the inevitable interconnect wire resistance in RCAs remains a daunting challenge. In this article, we propose a fast and efficient training and validation framework to incorporate the wire resistance in Quantized DNNs, without the need for computationally extensive SPICE simulations during the training process. A fabricated four-bit Au/Al2O3/HfO2/TiN device is modelled and used within the framework with two-mapping schemes to realize the quantized weights. Efficient system-level IR-drop estimation methods are used to accelerate training. SPICE validation results show the effectiveness of the proposed method to capture the IR drop problem achieving the baseline accuracy with a 2% and 4% drop in the worst-case scenario for MNIST dataset on multilayer perceptron network and CIFAR 10 dataset on modified VGG and AlexNet networks, respectively. Other nonidealities, such as stuck-at fault defects, variability, and aging, are studied. Finally, the design considerations of the neuronal and the driver circuits are discussed

ScholarWorks@UNIST

On-chip memory optimization for high-level synthesis of multi-dimensional data on FPGA

Author: Kim Daewoo
Lee Jongeun
Lee Sugil
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 21/01/2019
Field of study

It is very challenging to design an on-chip memory architecture for high-performance kernels with large amount of computation and data. The on-chip memory architecture must support efficient data access from both the computation part and the external memory part, which often have very different expectations about how data should be accessed and stored. Previous work provides only a limited set of optimizations. In this paper we show how to fundamentally restructure on-chip buffers, by decoupling logical array view from the physical buffer view, and providing general mapping schemes for the two. Our framework considers the entire data flow from the external memory to the computation part in order to minimize resource usage without creating performance bottleneck. Our experimental results demonstrate that our proposed technique can generate solutions that reduce memory usage significantly (2X over the conventional method), and successfully generate optimized on-chip buffer architectures without costly design iterations for highly optimized computation kernels

Crossref

ScholarWorks@UNIST

Automated Log-Scale Quantization for Low-Cost Deep Neural Networks

Author: Lee Jongeun
Lee Sugil
Oh Sangyun
Sim Hyeonuk
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 20/06/2021
Field of study

Quantization plays an important role in deep neural network (DNN) hardware. In particular, logarithmic quantization has multiple advantages for DNN hardware implementations, and its weakness in terms of lower performance at high precision compared with linear quantization has been recently remedied by what we call selective two-word logarithmic quantization (STLQ). However, there is a lack of training methods designed for STLQ or even logarithmic quantization in general. In this paper we propose a novel STLQ-aware training method, which significantly outperforms the previous state-of-the-art training method for STLQ. Moreover, our training results demonstrate that with our new training method, STLQ applied to weight parameters of ResNet-18 can achieve the same level of performance as state-of-the-art quantization method, APoT, at 3-bit precision. We also apply our method to various DNNs in image enhancement and semantic segmentation, showing competitive results

ScholarWorks@UNIST

Successive log quantization for cost-efficient neural networks using stochastic computing

Author: Choi Jooyeon
Lee Jongeun
Lee Sugil
Sim Hyeonuk
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 02/06/2019
Field of study

Despite the multifaceted benefits of stochastic computing (SC) such as low cost, low power, and flexible precision, SC-based deep neural networks (DNNs) still suffer from the long-latency problem, especially for those with high precision requirements. While log quantization can be of help, it has its own accuracy-saturation problem due to uneven precision distribution. In this paper we propose successive log quantization (SLQ), which extends log quantization with significant improvements in precision and accuracy, and apply it to state-of-the-art SC-DNNs. SLQ reuses the existing datapath of log quantization, and thus retains its advantages such as simple multiplier hardware. Our experimental results demonstrate that our SLQ can significantly extend both the accuracy and efficiency of SCDNNs over the state-of-the-art solutions, including linear-quantized and log-quantized SC-DNNs, achieving less than 1???1.5%p accuracy drop for AlexNet, SqueezeNet, and VGG-S at mere 4???5-bit weight resolution. ?? 2019 Copyright held by the owner/author(s)

Crossref

ScholarWorks@UNIST

Double MAC on a DSP: Boosting the Performance of Convolutional Neural Networks on FPGAs

Author: Kim Daewoo
Lee Jongeun
Lee Sugil
Nguyen Dong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2019
Field of study

Deep learning such as Convolutional Neural Networks (CNNs) are an important workload increasingly demanding high-performance hardware acceleration. One distinguishing feature of deep learnng workload is that it is inherently resilient to small numerical errors and works very well with low precision hardware. Thus we propose a novel method, called Double MAC, to theoretically double the computation rate of CNN accelerators by packing two multiply-and-accumulate (MAC) operations into one DSP block of off-the-shelf FPGAs. There are several technical challenges, which we overcome by exploiting the mode of operation in the CNN accelerator. We have validated our method through FPGA synthesis and Verilog simulation, and evaluated our method by applying it to the state-of-the-art CNN accelerator. We find that our Double MAC approach can increase the computation throughput of a CNN layer by twice. On the network level (all convolution layers combined), the performance improvement varies depending on the CNN application and FPGA size, from 14% to more than 80% over a highly optimized state-of-the-art accelerator solution, without sacrificing the output quality significantly

Crossref

ScholarWorks@UNIST

Offline Training-based Mitigation of IR Drop for ReRAM-based Deep Neural Network Accelerators

Author: Eltawil Ahmed
Fouda Mohammed E.
Kurdahi Fadi
Lee Jongeun
Lee Sugil
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2022
Field of study

Recently, ReRAM-based hardware accelerators showed unprecedented performance compared the digital accelerators. Technology scaling causes an inevitable increase in interconnect wire resistance, which leads to IR drops that could limit the performance of ReRAM-based accelerators. These IR drops deteriorate the signal integrity and quality especially in the Crossbar structures which are used to build high density ReRAMs. Hence, finding a software solution that can predict the effect of IR drop without involving expensive hardware or SPICE simulations, is very desirable. In this paper, we propose two neural networks models to predict the impact of the IR drop problem. These models are uded to evaluate the performance of the different deep neural networks (DNNs) models including binary and quantized neural networks showing similar performance (i.e., recognition accuracy) to the golden validation (i.e., SPICE-based DNN validation). In addition, these predication models are incorporated into DNNs training framework to efficiently retrain the DNN models and bridge the accuracy drop. To further enhance the validation accuracy, we propose incremental training methods. The DNN validation results, done through SPICE simulations, show very high improvement in performance close to the baseline performance, which demonstrates the efficacy of the proposed method even with challenging datasets such as CIFAR10 and SVHN

ScholarWorks@UNIST

Accurate Prediction of ReRAM Crossbar Performance Under I-V Nonlinearity and IR Drop

Author: Eltawil Ahmed
Fouda Mohammed
Kurdahi Fadi
Lee Jongeun
Lee Sugil
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/10/2022
Field of study

ScholarWorks@UNIST

MLogNet: A Logarithmic Quantization-Based Accelerator for Depthwise Separable Convolution

Author: Choi Jooyeon
Lee Jongeun
Lee Sugil
Oh Sangyun
Sim Hyeonuk
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2022
Field of study

In this paper we propose a novel logarithmic quantization-based DNN (Deep Neural Network) architecture for depthwise separable convolution (DSC) networks. Our architecture is based on selective two-word logarithmic quantization (STLQ), which improves accuracy greatly over logarithmic-scale quantization while retaining the speed and area advantage of logarithmic quantization. On the other hand, it also comes with the synchronization problem due to variable-latency PEs (processing elements), which we address through a novel architecture and a compile-time optimization technique. Our architecture is dynamically reconfigurable to support various combinations of depthwise vs. pointwise convolution layers efficiently. Our experimental results using layers from MobileNetV2 and ShuffleNetV2 demonstrate that our architecture is significantly faster and more area-efficient than previous DSC accelerator architectures as well as previous accelerators utilizing logarithmic quantization

ScholarWorks@UNIST

Fast and Low-Cost Mitigation of ReRAM Variability for Deep Learning Applications

Author: Eltawil Ahmed
Fouda Mohammed
Kurdahi Fadi
Lee Jongeun
Lee Sugil
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/10/2021
Field of study

To overcome the programming variability (PV) of ReRAM crossbar arrays (RCAs), the most common method is program-verify, which, however, has high energy and latency overhead. In this paper we propose a very fast and low-cost method to mitigate the effect of PV and other variability for RCA-based DNN (Deep Neural Network) accelerators. Leveraging the statistical properties of DNN output, our method called Online Batch-Norm Correction (OBNC) can compensate for the effect of programming and other variability on RCA output without using on-chip training or an iterative procedure, and is thus very fast. Also our method does not require a nonideality model or a training dataset, hence very easy to apply. Our experimental results using ternary neural networks with binary and 4-bit activations demonstrate that our OBNC can recover the baseline performance in many variability settings and that our method outperforms a previously known method (VCAM) by large margins when input distribution is asymmetric or activation is multi-bit

ScholarWorks@UNIST