11 research outputs found
Probabilistic Circuits for Autonomous Learning: A simulation study
Modern machine learning is based on powerful algorithms running on digital
computing platforms and there is great interest in accelerating the learning
process and making it more energy efficient. In this paper we present a fully
autonomous probabilistic circuit for fast and efficient learning that makes no
use of digital computing. Specifically we use SPICE simulations to demonstrate
a clockless autonomous circuit where the required synaptic weights are read out
in the form of analog voltages. Such autonomous circuits could be particularly
of interest as standalone learning devices in the context of mobile and edge
computing
Non-Ideal Program-Time Conservation in Charge Trap Flash for Deep Learning
Training deep neural networks (DNNs) is computationally intensive but arrays
of non-volatile memories like Charge Trap Flash (CTF) can accelerate DNN
operations using in-memory computing. Specifically, the Resistive Processing
Unit (RPU) architecture uses the voltage-threshold program by stochastic
encoded pulse trains and analog memory features to accelerate vector-vector
outer product and weight update for the gradient descent algorithms. Although
CTF, offering high precision, has been regarded as an excellent choice for
implementing RPU, the accumulation of charge due to the applied stochastic
pulse trains is ultimately of critical significance in determining the final
weight update. In this paper, we report the non-ideal program-time conservation
in CTF through pulsing input measurements. We experimentally measure the effect
of pulse width and pulse gap, keeping the total ON-time of the input pulse
train constant, and report three non-idealities: (1) Cumulative V_T shift
reduces when total ON-time is fragmented into a larger number of shorter
pulses, (2) Cumulative V_T shift drops abruptly for pulse widths < 2 {\mu}s,
(3) Cumulative V_T shift depends on the gap between consecutive pulses and the
V_T shift reduction gets recovered for smaller gaps. We present an explanation
based on a transient tunneling field enhancement due to blocking oxide
trap-charge dynamics to explain these non-idealities. Identifying and modeling
the responsible mechanisms and predicting their system-level effects during
learning is critical. This non-ideal accumulation is expected to affect
algorithms and architectures relying on devices for implementing mathematically
equivalent functions for in-memory computing-based acceleration
Training LSTM Networks with Resistive Cross-Point Devices
In our previous work we have shown that resistive cross point devices, so
called Resistive Processing Unit (RPU) devices, can provide significant power
and speed benefits when training deep fully connected networks as well as
convolutional neural networks. In this work, we further extend the RPU concept
for training recurrent neural networks (RNNs) namely LSTMs. We show that the
mapping of recurrent layers is very similar to the mapping of fully connected
layers and therefore the RPU concept can potentially provide large acceleration
factors for RNNs as well. In addition, we study the effect of various device
imperfections and system parameters on training performance. Symmetry of
updates becomes even more crucial for RNNs; already a few percent asymmetry
results in an increase in the test error compared to the ideal case trained
with floating point numbers. Furthermore, the input signal resolution to device
arrays needs to be at least 7 bits for successful training. However, we show
that a stochastic rounding scheme can reduce the input signal resolution back
to 5 bits. Further, we find that RPU device variations and hardware noise are
enough to mitigate overfitting, so that there is less need for using dropout.
We note that the models trained here are roughly 1500 times larger than the
fully connected network trained on MNIST dataset in terms of the total number
of multiplication and summation operations performed per epoch. Thus, here we
attempt to study the validity of the RPU approach for large scale networks.Comment: 17 pages, 5 figure
Mixed-precision deep learning based on computational memory
Deep neural networks (DNNs) have revolutionized the field of artificial
intelligence and have achieved unprecedented success in cognitive tasks such as
image and speech recognition. Training of large DNNs, however, is
computationally intensive and this has motivated the search for novel computing
architectures targeting this application. A computational memory unit with
nanoscale resistive memory devices organized in crossbar arrays could store the
synaptic weights in their conductance states and perform the expensive weighted
summations in place in a non-von Neumann manner. However, updating the
conductance states in a reliable manner during the weight update process is a
fundamental challenge that limits the training accuracy of such an
implementation. Here, we propose a mixed-precision architecture that combines a
computational memory unit performing the weighted summations and imprecise
conductance updates with a digital processing unit that accumulates the weight
updates in high precision. A combined hardware/software training experiment of
a multilayer perceptron based on the proposed architecture using a phase-change
memory (PCM) array achieves 97.73% test accuracy on the task of classifying
handwritten digits (based on the MNIST dataset), within 0.6% of the software
baseline. The architecture is further evaluated using accurate behavioral
models of PCM on a wide class of networks, namely convolutional neural
networks, long-short-term-memory networks, and generative-adversarial networks.
Accuracies comparable to those of floating-point implementations are achieved
without being constrained by the non-idealities associated with the PCM
devices. A system-level study demonstrates 173x improvement in energy
efficiency of the architecture when used for training a multilayer perceptron
compared with a dedicated fully digital 32-bit implementation
Analog CMOS-based Resistive Processing Unit for Deep Neural Network Training
Recently we have shown that an architecture based on resistive processing unit (RPU) devices has potential to achieve significant acceleration in deep neural network (DNN) training compared to today's software-based DNN implementations running on CPU/GPU. However, currently available device candidates based on non-volatile memory technologies do not satisfy all the requirements to realize the RPU concept. Here, we propose an analog CMOS-based RPU design (CMOS RPU) which can store and process data locally and can be operated in a massively parallel manner. We analyze various properties of the CMOS RPU to evaluate the functionality and feasibility for acceleration of DNN training.1