A binary-activation, multi-level weight RNN and training algorithm for
  processing-in-memory inference with eNVM by Ma, Siming et al.
A binary-activation, multi-level weight RNN and training
algorithm for processing-in-memory inference with eNVM
Siming Ma
Harvard University
simingma@g.harvard.edu
David Brooks
Harvard University
dbrooks@eecs.harvard.edu
Gu-Yeon Wei
Harvard University
gywei@g.harvard.edu
ABSTRACT
We present a new algorithm for training neural networks with
binary activations and multi-level weights, which enables efficient
processing-in-memory circuits with eNVM. Binary activations obvi-
ate costly DACs and ADCs. Multi-level weights leverage multi-level
eNVM cells. Compared with previous quantization algorithms, our
method not only works for feed-forward networks including fully-
connected and convolutional, but also achieves higher accuracy and
noise resilience for recurrent networks. In particular, we present
a RNN trigger-word detection PIM accelerator, whose modeling
results demonstrate high performance using our new training algo-
rithm.
1 INTRODUCTION
Processing-in-memory (PIM) architectures for hardware neural
network (NN) inference have gained increasing traction as they
solve the memory bottleneck of traditional von Neumann archi-
tectures [24]. While PIM architectures apply to a variety of mem-
ories, including SRAM and eDRAM, it is especially advantageous
to use emerging embedded non-volatile memories (eNVM), due to
their higher storage density and multi-level-cell (MLC) capability
[25][2][15]. Moreover, their non-volatility and power efficiency
are especially well suited for inference tasks that require relatively
fixed NN parameters. PIMwith eNVMs not only avoids high-energy,
long-latency, off-chip DRAM accesses by densely storing all NN
parameters on-chip, but it also minimizes inefficient on-chip data
movement and intermediate data generation by embedding the crit-
ical multiplication and accumulation (MAC) computations within
the memory arrays.
The resulting highly-efficient MAC computations within the
memory arrays are, however, analog in nature, which poses a few
challenges. First, analog computations are vulnerable to noise from
the memory devices and circuits, mandating sufficient noise re-
silience of the NN models. Second and more importantly, typical
NNs use high-precision neuron activations that require DACs and
ADCs (for feeding inputs to and resolving MAC outputs from the
memory arrays, respectively), which introduce significant area,
power and latency overhead. The die photo in [29] shows the area
of 5-bit input DACs consume more than one third of the area of the
SRAM PIM array, translating to even greater relative overhead if
the SRAMwere to be replaced by a much denser eNVM. A thorough
design space exploration of ReRAM PIM architecture in [24] shows
the optimal configuration is to share one 8-bit ADC among all the
128 columns of a 128x128 memristor array. Yet, this single ADC
still occupies 48 times the area and consumes 6.7 times the energy
of the entire memristor array. In fact, due to the prohibitively high
cost of ADCs and DACs, many PIM designs resort to feeding and/or
resolving activations 1 bit at a time in a sequential fashion, but this
results in huge latency penalties [24][26]. Such high overhead from
ADCs and DACs completely defeats the original goal of attaining
speed, power, and area efficiency by using PIM. In addition, the
weights of conventional NNs usually require higher resolution (e.g.,
8 bits) than are available in typical MLCs (e.g., 2 or 3 bits), which re-
quires each weight to be implemented by combining many memory
cells and compromises area efficiency [24].
Reducing the bit precision of activations and weights can miti-
gate DAC and ADC overhead and reduce the number of cells needed
for each weight. Hence, some PIM designs implement binary neural
networks (BNN), with both binary weights and binary activations,
obviating DACs and ADCs entirely, and only needing one single-
level cell (SLC) per weight [12]. However, BNNs are not optimal for
two reasons. First, the most popular BNN training algorithms use
the straight-through estimator (STE) to get around BNN’s indiffer-
entiability problem during backpropagation [3][9][20]. However, as
we will show in Sections 4 and 5, STE for binarizing activations is
only effective for training feedforward NNs, such as fully-connected
(FC) and convolutional NNs (CNN), but works poorly for recurrent
NNs (RNN). Second, binary weights in BNNs are too stringent to
maintain high inference accuracy, and do not take full advantage
of the MLC capabilities of eNVMs.
Our major contributions are:
• To enable the ideal DAC/ADC-free PIM architecture with
dense MLC eNVMs, we design a new algorithm to train NNs
with binary activations (BA) that avoid DACs and ADCs, and
multi-level weights (MLW) that take full advantage of dense
MLCs. This algorithm achieves higher inference accuracy
than using STE for training BA-MLW RNNs.
• We design a DAC/ADC-free trigger word detection PIM ac-
celerator with MLC eNVM, using a BA-MLW gated recurrent
unit (GRU), which demonstrates our algorithm’s high perfor-
mance on RNNs. We build a noise model of the accelerator
using eNVM measurement data and circuit simulations, and
verify superior accuracy and noise resilience of our BA-MLW
GRU compared with alternative algorithms.
• We also apply our algorithm to LeNet5 for MNIST, which
shows the effectiveness our BA-MLW algorithm on feedfor-
ward networks.
2 BACKGROUND
MLC eNVM for PIM. Before discussing the PIM architecture, it
is necessary to introduce the technologies available for its critical
building block – the memory array. Although the memory array can
be built with conventional SRAM, it is more advantageous using
eNVM, including traditional embedded Flash (eFlash) [7], or emerg-
ing resistive RAM (ReRAM) [25] and phase change memory (PCM)
[2], or more recently, the purely-CMOS MLC eNVM (CMOS-MLC)
ar
X
iv
:1
91
2.
00
10
6v
2 
 [c
s.L
G]
  3
 D
ec
 20
19
Siming Ma, David Brooks, and Gu-Yeon Wei
Xi
Xi+ 1
Yj = ∑
i
Xi ⋅ Wi, j Yj+ 1 = ∑
i
Xi ⋅ Wi, j+ 1
Y˜j = f (Yj) Y˜j+ 1 = f (Yj+ 1)
DAC
DAC
ADC ADC
f ( ⋅ ) f ( ⋅ )
WL-DACs
BL-ADCs
each high resolution weight may use
 a combination of many cells
Xi
Xi+ 1
Y˜j Y˜j+ 1
1/0
1/0
BL sense-amps
(comparators)
each multi-level weight uses a pair of MLCWL-drivers
(a) (c)
W +i, jW
−
i, j
W +i+ 1, jW
−
i+ 1, j
W +i, j+ 1W
−
i, j+ 1
W +i+ 1, j+ 1W
−
i+ 1, j+ 1
Yj = ∑
i
Xi ⋅ Wi, j Yj+ 1 = ∑
i
Xi ⋅ Wi, j+ 1
Wi, j+ 1Wi, j
Wi+ 1, j Wi+ 1, j+ 1
Xi
Xi+ 1
Y˜j Y˜j+ 1
1/0
1/0
BL sense-amps
(comparators)
each binary weight uses a SLCWL-drivers
(b)
-1/+1
-1/+1
-1/+1
-1/+1
Yj = ∑
i
Xi ⋅ Wi, j Yj+ 1 = ∑
i
Xi ⋅ Wi, j+ 1
Figure 1: (a) The conventional PIM architecture with DACs and ADCs for high-precision activations and a combination of
many cells for each high-resolution weight. (b) a DAC/ADC-free PIM architecture implementing BNNs, using SLCs for binary
weights, and (b) the optimal DAC/ADC-free PIM architecture implementing our BA-MLW NNs, using a pair of MLCs for each
MLW.
[15]. Compared with SRAM which is inherently binary (single level
cells, SLC), eNVM’s SLCs offer much higher area efficiency, and
eNVM is often analog in nature that enables MLC capability for
even higher storage density. Programming eNVM typically involves
a continuous change in the conductivity of the memory devices,
enabling them to store multiple levels of transister channel current
in the cases of eFlash and CMOS-MLC, or multiple levels of resistor
conductivity in the cases of ReRAM and PCM. The programming
speed of eNVM is much slower than SRAM, but the NN parameters
are written only once and held constant during inference, rendering
programming speed non-critical for inference-only applications.
In fact, eNVMs’ non-volativity offers energy savings and avoids
needing to reload the weights at power-ups.
PIM for NN inference. The conventional PIM architecture for NN
inference is shown in Figure 1a [24]. A weight matrix of the NN is
directly mapped into the memory array, and since for typical NNs,
the weights usually require high resolutions (eg. ≥8 bits), multi-
ple lower-resolution memory cells are combined to represent one
weight, at the expense of area density [24]. This PIM structure can
perform a matrix-vector multiplication in one step: all the input
activations are fed into WLs simultaneously as voltage signals from
the WL-DACs, generating a current through each memory cell that
is proportional to the product of its input and conductance, and
the MAC results are accumulated along corresponding BLs and
resolved by column ADCs before being sent to nonlinearity func-
tion unit. For implementing conventional NNs with high-precision
activations, the DACs and ADCs are necessary for translating input
activations into analog voltage signals and digitizing BL currents
into MAC outputs. However, as discussed in Section 1, these DACs
and ADCs result in huge area and power overhead.
Quantization and BNN. NN quantization algorithms have been
proposed to reduce the bit widths of weights and/or activations
while maximizing accuracy [9][20][14], whose original goal is to
save storage and computations. For PIM implementations, quanti-
zation algorithms can also relieve AD/DA resolution requirements
for activations, and in particular, BNNs with 1-bit activations and
weights, can translate into the much simpler PIM circuits in Figure
1b than the conventional (Figure 1a). — since the activations are
binary, WL-DACs and BL-ADCs in Figure 1a can be replaced by
simple WL-drivers and sense-amp comparators, respectively, both
of which are compact peripheral components in standard memories.
However, PIM implementations of BNNs have two major problems.
First, as shown in Figure 1b, the binary weights can only use eNVM
cells as 1-bit SLCs, wasting the MLC capability. Second, the most
popular existing algorithm for training binary activations uses STE
[3][9] (Section 3.1.1), which is only effective for feedforward NNs,
but performs poorly on RNNs.
3 TRAINING BA-MLW NNS FOR OPTIMAL
PIM IMPLEMENTATION
To avoid DACs and ADCs and fully leverage MLCs, we identify the
BA-MLW NN structure as the optimal for PIM implementations,
shown in Figure 1c. For simplicity, we show the memory devices as
resistor cells (omitting access transistors), corresponding to ReRAM
or PCM, with their conductivities encoding the weights; while they
could easily be replaced by transistor cells, corresponding to eFlash
or CMOS-MLC, whose channel currents encode the weights. The
current differential of a pair of MLCs is used to represent one
weight that could be either negative or positive. BAs only require
WL-drivers and sense-amps instead of DACs and ADCs, and each
MLW only uses one pair of MLCs for maximal storage density. To
effectively train BA-MLW NNs, we develop a new algorithm that
achieves high accuracy and resilience to quantization and noise not
only for feedforward NNs, but also for RNNs.
3.1 Training binary activations (BA)
Binarizing the activations while maintaining high performance is
challenging, because it not only restricts the expressive capacity
of the neurons, but also introduces discrete computation nodes
that preclude gradient propagation during training. We first review
the previous STE algorithm, and then present our new and more
effective BA training algorithm.
3.1.1 Reviewing STE. STE applies to the following stochastic bi-
nary neuron (SBN) [3]. During forward propagation of training,
A binary-activation, multi-level weight RNN and training algorithm for processing-in-memory inference with eNVM
each neuron generates a binary output from a Bernoulli sample:
xSBN=
{
1,with probability p=sigmoid(s · x)
0,with probability 1 − p (1)
in which x is a pre-activation from the linear MAC, and the logistic
sigmoid function has a tunable slope s [5]. The SBN function is
discrete with random sampling, and thus does not have a well
defined gradient, but STE simply passes through the gradient of
the continuous sigmoid function during backpropagation:
∂L
∂x
=
∂L
∂xSBN
· sigmoid′(s · x) (2)
in which L is the loss function. In other words, it ignores the random
discrete sampling process, and pretends the forward propagation
is done properly with a sigmoid function. The issue with STE is
that propagating gradients w.r.t. the sample-independent mean
(xSBN=sigmoid(s ·x)) while ignoring the random sampling outcome
can cause discrepancies between the forward and backward pass
[10]. In fact, STE is a biased estimator of the expected gradient, and
cannot even guarantee the correct sign when back-propagating
through multiple hidden layers [3]. Nonetheless, people find STE
works better in practice than other more complicated gradient
estimators for feedforward NNs [3], as we also verify in Section
5. However, as shown in Section 4, STE performs poorly when
training RNN with BAs.
3.1.2 Our noisy neuron annealing (NNA) algorithm. We use the
following noisy continuous neuron (NCN) function during the
forward pass of training:
xNCN=sigmoid(x + ntrain
τ
) (3)
in which we add an iid zero-mean Gaussian random variable (RV)
ntrain∼N (0,σ 2train ) to each pre-activation before passing into a
continuous sigmoid function with temperature τ . Equation 3 can
be broken down into two steps: the noise injection step x˜=x +
ntrain , and a continuous relaxing step xNCN=sigmoid(x˜/τ ). The
idea behind the noise injection step is that the effect of binarizing
activations and quantizing weights is to introduce quantization
errors that flow forward through MAC and convert into random
noise added into pre-activations. Therefore if we train the NN with
explicitly added noise into pre-activations, the NN would develop
resilience to these quantization errors. The continuous relaxing
step is inspired by the Gumbel-softmax trick [10][16], which uses a
sharpened sigmoid to approximate the binary step function while
still allowing smooth gradients to flow. As we will demonstrate
more experimental details in Section 4, it is important to start from
a large value of the hyperparameter σtrain to train with large noise,
and later anneal it down to a small value, hence the name of our
training strategy – “noisy neuron annealing” (NNA) algorithm.
As is also related to the additive noise in variational autoencoder,
Gaussian noise distribution complies to the “mean and variance”
form required by the re-parameterization trick [16], and has a nice
Gaussian gradient identity property [22] that allows it to change
the order between taking expectation and taking derivative, such
that the gradients can flow without encountering random sampling
nodes. Combined with the continuous relaxation step, backpropa-
gation through the entire NCN function can be done legitimately
Table 1: An example ofmapping 7-level weights into the cur-
rent magnitudes of a pair of 4-level cells. If s is the full-scale
current that corresponds to α , the weight clipping scale fac-
tor . I−cell and I
−
cell are the negative and positive cell currents.
weight −α − 23α − 13α 0 13α 23α α
I−cell If s
2
3 If s
1
3 If s 0 0 0 0
I+cell 0 0 0 0
1
3 If s
2
3 If s If s
without encountering any discrete or sampling node:
∂L
∂x
=
∂L
∂xNCN
· sigmoid′(x + ntrain
τ
) (4)
We use the following noisy binary neuron (NBN) for inference:
xNBN=
{
1, if x + neval>0
0, otherwise
(5)
which also has an additive iid Gaussian RV neval∼N (0,σ 2eval ),
but replaces the continuous sigmoid in NCN with a discrete step
function. In the experiments of Sections 4 and 5, we evaluate the
noise resilience of trained NNs by sweeping σeval .
Previous work has studied the regularization effect of noise
injection regarding its impact on NNs’ generalization and noise
resilience [17][11][1][21]. They use a Taylor expansion on the loss
function to derive that the effect of adding Gaussian noise is to add
an extra regularization penalty term to the original loss function L,
such that the effective loss becomes:
L˜=L + P=L +
1
2σ
2
train
∑
i
( ∂L
∂xi
)2 (6)
in which xi refers to a certain noise-injected node, and in our
case, xi includes all the pre-activations. The regularization term P
penalizes large gradients of L w.r.t. noise-injected nodes, encour-
aging these nodes to find “flatter regions” of the solution space
that are less sensitive to noise perturbations, and hyperparame-
ter σtrain controls the tradeoff between reducing the raw error L
and enhancing noise resilience. Specifically, for NCN activations to
counteract the noise, they tend to give up the highly-expressive but
noise-prone transition region of the sigmoid, and develop a bimodal
pre-activation distribution to push them into the saturated regions
close to 1 or 0 that are highly immune to noise [23]. Equation 6 also
provides a quantitative metric to estimate the noise resilience a NN
acquires during training, and we derive a detailed form to calculate
this penalty term in Section 4 for comparing amongst NNs.
3.2 Training multi-level weights (MLW)
Our NNA algorithm not only endows the NN with high resilience
to binarizing activations, but also enables MLWs to leverage dense
MLCs with high inference accuracy. Each weight can be quantized
down to a small number of levels capable of encoding with one
pair of MLCs (Figure 1c), as opposed to needing to combine more
memory cells for high-resolution weights (Figure 1a).
To quantize MLWs from full-precision (FP) weights, we first
determine a suitable clipping range [−α ,α] for each weight matrix
based on the weight distribution statistics from pre-training with
FP weights. Then during fine-tuning with weight quantizations,
Siming Ma, David Brooks, and Gu-Yeon Wei
Figure 2: (a) shows the BA-MLW GRU architecture, whose input FC and 2 GRU layers use BAs, (b) shows the PIM implemen-
tation of a GRU layer, detailing the MLC array and GRU logic circuitry, and (c) is the noise model of the PIM circuits.
we first clip each weight matrix into [−α ,α], then quantize the
weights into evenly spaced levels within this range. We follow the
same practice as [9] for training — we use the quantized weights
in the forward pass, but still keep the FP weights and accumulate
gradients onto FP weights in the backward pass. After training
is done, the FP weights can be discarded, and only the quantized
weights are used in inference. For the special cases of 3 and 2-level
weights, we use the training algorithm in [14] for 3-level (ternary)
and [20] for 2-level (binary) weights.
Note that the matrix-wise scale factors α is only relevant during
training, since they impact the gradient values, but when we use
a binary step activation function for inference, only the relative
values of weight levels matter. So for mapping the weight matrices
into PIM arrays’ current magnitudes, we can re-scale the weights
into suitable MLC current magnitudes so long as their relative ratios
are maintained. Table 1 shows the example of using the current
differential of a pair of 4-level (2-bit) MLCs to represent a 7-level
weight. Following the same principle, a 15-level weight can be
encoded with a pair of 8-level (3-bit) MLCs, and a 3-level weight
can use a pair of binary cells (1-bit, SLC).
4 A TRIGGERWORD DETECTION PIM
ACCELERATOR USING BA-MLW GRU
WITH MLC ENVM
Trigger word detection is an important always-on task in speech
activated edge devices, for which power and cost efficiency is para-
mount. We use the Speech Commands dataset [27], which consists
of over 105,000 audio clips of various words uttered by thousands
of different people, with a total of 12 classification categories: 10
designated keywords, silence, and unknown words.
The NN model structure. RNN is well suited to this speech recog-
nition task, and in particular, we use a 2-layer gated recurrent unit
(GRU, [4]) with BA and MLW, shown in Figure 2a. We first perform
FFT (window=16ms, stride=8ms) on the raw audio signals, and then
use MFCC to extract 40 coefficients per 8ms timestep. Each MFCC
vector passes through the input FC layer that encodes it into a 128
dimensional binary vector as the input to the first layer of a 2-layer
stacked GRU (both layers use 128 dimensional vectors). After the
GRU processes inputs from all the timesteps, the final timestep
output of the top layer is fed into an output FC layer followed by
12-way softmax to get a classification result. We use the following
version of GRU equations [18]:
G˜l<t>=Wдl · [H l<t − 1>,H l−1<t>] (7)
Gl<t>=f (G˜l<t>) (8)
C˜l<t>=Wcl · [H l<t − 1>,H l−1<t>] (9)
Cl<t>=f (C˜l<t>) (10)
H l<t>=Gl<t>⊗H l<t − 1>+(1 −Gl<t>) ⊗ Cl<t> (11)
in which t denotes the timestep, l is the layer number, and the
gate Gl<t> (l=1, 2), candidate Cl<t> (l=1, 2), hidden state H l<t>
(l=1, 2), and the input encoding H0<t> are all 128 dimensional
activation vectors, trained using NNA algorithm. The activation
function f refers to NCN (Equation 3) during training, and NBN
(Equation 5) for evaluation, and simply uses the binary step function
for PIM deployment (Figure 2a and 2b). Compared with the original
forms of GRU equations [4], we remove the reset gate since we find
it has minimal effect on the accuracy of this task, and this version
of GRU equations greatly simplifies the circuit designs (Figure 2b).
PIM circuit design for the GRU. The BA-MLW GRU equations
can be mapped into very elegant PIM circuits shown in Figure
2b. After the MLC array finishes MAC and resolves the binarized
Gl<t> and Cl<t> in the sense-amps, Gl<t> serves as the multi-
plexier selection signal (since 1 −Gl<t>=!Gl<t> in Equation 11
for binary signals) to choose either to keep the binary hidden state
saved from the previous timestep, or to update it with the current
binary candidate state. There is no need for ADC, DAC, or any
explicit analog signal processing for the GRU computations, since
all the analog MAC signals are encapsulated inside the MLC array
A binary-activation, multi-level weight RNN and training algorithm for processing-in-memory inference with eNVM
In
fe
re
nc
e 
ac
cu
ra
cy
 (%
)
σevalEvaluation noise σevalEvaluation noise σevalEvaluation noise
Zoom
(a) (b) (c)
Figure 3: Inference accuracy vs σeval of evaluation noise of GRUs trainedwith differentmethods.We run 100Monte Carlo sam-
ples for the random noise, and plot the mean accuracy surrounded with the ranges of mean±STD and max/min. (a) compares
4 networks trained through different stages, in which the baseline network is evaluated with both FP activations and BAs, (b)
compares different weight quantization levels, using NNA algorithm, and (c) compares NNA with STE algorithm. Except for
baseline-AF P -WF P that uses FP activations, all the other curves are evaluated with BAs using NBN Equation 5.
with all the input/output interface signals being binary, and the
GRU logic outside the array is totally digital.
Hardware noise model. Figure 2c shows how we model the hard-
ware noise from eNVM devices and the circuits. For the eNVM
devices, we consider ReRAM, PCM and CMOS-MLC, using the mea-
sured data from [25], [2], and [15], respectively. Non-idealities of
the eNVM devices include the static distributions of different MLC
levels after programming, drift during retention, and random tele-
graphic noise (RTN) [6], all of which have much slower frequency
ranges than the intended circuit clock rate (>10∼100MHz), so we
lump these noise sources into a static error model of MLC devices.
For modeling the circuit noise, we simulate with TSMC’s 16nm
FinFET process [28], and consider thermal and shot noise from
the memory array and load devices, which are wide-band white
noise sources manifesting as kT /C noise sampled at sense-amp’s
input. We use the offset-canceling sense-amp from [12], and the
total sampling capacitance is estimated to beCI N +CBL≈2f F , con-
tributing kT /C noise with STD=2mV sampled at the differential
input of sense-amp. Based on the NN algorithm’s statistics, the pre-
activation distributions have maximum values between −30∼+30,
and we allocate a dynamic voltage range of −600mV∼+600mV for
the sense-amp’s differential input to accommodate this algorithmic
range. Therefore, the 2mV kT /C noise translates to dynamic algo-
rithmic noise of magnitude σkT /C=0.1 added to pre-activations. We
consider both the static errors from eNVM devices and the dynamic
noise from the circuits when we verify the hardware performance
using our noise model.
Noise-induced loss penalty terms for GRU. Using the GRU equa-
tions, we can derive the noise-induced regularization penalty term
P in Equation 6 by taking the derivatives of loss L w.r.t. the pre-
activations of candidates (C˜l<t>) and gates (G˜l<t>).
P=
1
2σ
2
train
∑
i,t,l
( ∂L
∂hli<t>
)2 · {Pд + Pc} (12)
Pд=[(hli<t − 1>−cli<t>) · sigmoid′(д˜li<t>)]2 (13)
Pc=[(1 − дli<t>) · sigmoid′(˜cli<t>)]2 (14)
Table 2: Loss penalty terms of the 4 NNs in Figure 3a.
Network FP baseline NNA(σL) NNA(σL→σS ) NNA(σS )
Pg+Pc 1.0000 0.5034 0.5018 0.8272
in which a lowercase letter with subscript i=0∼127 denotes one
element of the corresponding uppercase vector, and Pд and Pc
regularize gates and candidates, respectively. The forms of Equation
13 and 14 have intuitive interpretations when minimizing them
during training. To minimize Pд, one way is to reduce the derivative
of the gate (sigmoid′(д˜li<t>)) by pushing д˜li<t> away from zero
(the steep slope region of sigmoid), so that the gate is firmly on
or firmly off; alternatively, it can try to make the candidate of the
current timestep cli<t> equal to the hidden state of the previous
timestep hli<t − 1>, such that the new hidden state hli<t> would
be the same no matter д˜li<t> is 1 or 0 — either way, minimizing
Pд can make the computation result of hli<t> immune to noise
injected to д˜li<t>. Similarly for minimizing Pc , it will either reduce
the gradient of candidate (sigmoid′(˜cli<t>)) by pushing c˜li<t> into
the saturated flat regions of sigmoid, or it tries to turn on the gate
дli<t> to preserve the hidden state from the previous timestep
hli<t − 1> without caring about the new candidate cli<t> — either
way, it desensitizes hli<t> to noise injected in c˜
l
i<t>. Equation 12,
13 and 14 provide a quantitative metric to assess the resilience
to quantization and noise in a trained GRU, and we use them for
comparing training schemes in the next section.
Training scheme comparisons. As a full-precision (FP) baseline,
we first train the NN with FP sigmoid activations and FP weights,
without noise injection or quantization. During inference, we add
noise neval∼N (0,σ 2eval ) to its pre-activations and evaluate it with
both FP sigmoid activations (baseline-AF P -WF P , red), and binarized
activations (baseline-BA-WF P , yellow), shown in Figure 3a. Trained
without noise injection, baseline-AF P -WF P cannot maintain its
high accuracy at large σeval , making it vulnerable to quantization
errors, leading to the poor performance of baseline-BA-WF P .
To endow the NN with resilience to quantization errors, we use
our NNA algorithm and retrain from the FP sigmoid baseline (which
Siming Ma, David Brooks, and Gu-Yeon Wei
Figure 4: Modeled hardware inference accuracy of the trig-
ger word detection PIM accelerator using ReRAM, PCM, and
CMOS-MLC technologies, validated with the noise models
of memory devices and circuits (Figure 2c). We run each ex-
periment 100 times with different samples of static MLC de-
vice errors and dynamic circuit noise, and show the accuracy
distributions using the violin plots.
is a good initialization that speeds up retraining). Initially, we use a
large training noise σtrain=σL=1.6, and plot the inference accuracy
with BAs and 7-level weights (NNA(σL)-BA-W7) in Figure 3a. It in-
deed gains amuchwider tolerance range ofσeval , and thus stronger
resilience to quantization errors and higher accuracy than baseline-
BA-WF P . However, the accuracy peaks at a large σeval around σL ,
and drops at small σeval , because it is trained to minimize its loss at
the present of this large additive noise. This noise-resilience profile
might suit certain severely noisy circuit environments, for example,
large power supply noise. However, to also achieve high accuracy
at small σeval , we anneal σtrain down to a small σtrain=σS and
further retrain it. As shown in Figure 3a (NNA(σL→σS )-BA-W7),
peak accuracy further improves and is achieved at smaller σeval
as well. NNA’s annealing procedure from σL to σS is critical. As
shown in Figure 3a (NNA(σS )-BA-W7), directly retraining with σS
from the baseline without the intermediate σL stage results in much
worse accuracy and noise tolerance than NNA(σL→σS )-BA-W7.
To understand why the annealing procedure of NNA is critical,
we compare the loss penalty terms (Pд + Pc from Equation 13 and
14) of the 4 networks in Figure 3a. Shown in Table 2, we normal-
ize them by the penalty value of baseline-AF P -WF P , since only
the relative values matter for comparison. After retraining with
σL , NNA(σL)-BA-W7’s penalty term reduces to half of the baseline
penalty, explaining its higher resilience to noise and quantization
errors. After further retraining with annealed σS , NNA(σL→σS )-
BA-W7 maintains this small penalty value, even though σS provides
less regularization effect — the smaller multiplier σ 2S in Equation
12 (compared to previous σ 2L) makes the retraining prioritize more
on reducing the raw error (thus higher peak accuracy) rather than
enhancing noise resilience. This means the network can still “mem-
orize” its previous large noise training’s regularization effect even
after annealing to fine-tune with smaller noise. In contrast, without
the intermediate “experience” of large noise training, NNA(σS )-BA-
W7 has much less regularization effect to reduce its loss penalty.
It should also be pointed out that if evaluated with FP activations
and zero noise, all these networks achieve similarly high accuracy
as the purely FP sigmoid network, but their accuracy and noise
resilience are dramatically different after using binary activations.
This implies that trained through different stages, the 4 networks
in Figure 3a find distinct regions in the global solution space: re-
training with σL finds a promising solution region that’s insensitive
to quantization errors, while the further fine-tuning with σS only
does a local search to optimize accurcy at the small noise range; in
contrast, training without noise injection or only with small noise
will not discover the solution region that’s resilient to noise and
quantization due to lack of regularization penalty.
As was also mentioned in section 3.1.2, the reason that tolerance
to additive noise leads to resilience to quantization is that the effect
of binarizing activations and quantizing weights is to introduce
quantization errors in the NN computations. Based on central limit
theorum, during MAC, the summation of these quantization errors
(which are iid RVs) forms a Gaussian noise distribution added to
pre-activations. Therefore, if the NN is resilient to additive Gaussian
noise, it also tends to be resilient to quantization errors. Theoreti-
cally, after MAC, the effective standard deviation (STD) of noise due
to the summation of activation binarization errors and weight quan-
tization errors can be expressed as: σ 2A=
∑
i w
2
i σ
2
a=∥W ∥22 σ 2a , and
σ 2W =
∑
i x
2
i σ
2
w=∥X ∥22 σ 2w , respectively, in whichW , X , are weights
and inputs; σa and σw are the average quantization error STDs
of a single element of activation and weight, respectively. During
training with NNA algorithm, the statistics ofW , X , σa and σw
change dramatically to achieve noise resilience, and in practice, we
find the choices of the hyperparameter σtrain quite flexible. For
the initial large noise, we use a σL to be about 20% of the STD of
the inherent pre-activation distribution, corresponding to σL=1.6
for this GRU, though a wide range of values all work similarly well;
for the annealed noise, we find σS=0∼0.5 all achieves optimal final
results. For temperature τ in NCN, we find 0.3 to be optimal: too
small will cause RNN’s exploding gradient problem, while too large
cannot proximate BA well enough.
Weight quantization levels. Figure 3b shows GRU’s performance
using different numbers of weight quantization levels, which con-
firms the high resilience to weight quantization using our NNA
algorithm. The performances of 7-level (implemented with a pair
of 4-level cells) and 5-level (a pair of 3-level cells) quantizations
are almost the same as using FP weights, while 7-level weights
achieve the highest peak accuracy at the lower end of σeval range.
Ternary weights (a pair of 2-level cells) have degradation especially
in the large σeval range, but still maintain decent accuracy close
to FP weights in the small σeval range. Binary weights (can be
implemented with a SRAM cell), however, have more serious ac-
curacy degradation in both low and high σeval ranges. Our NNA
algorithm’s high resilience to weight quantization makes it possible
to use a pair of 4 or 3-level MLCs to achieve FP weight performance,
or even 2-level cells, if the hardware noise is within the small σeval
range. This avoids the need to combine multiple cells to achieve
high-resolution weights (Figure 1a, [24]), which greatly saves area,
and simplifies the memory array and logic design.
OurNNAalgorithmvs STE.We also experimentwith SBN trained
with STE, and compare its performance with our NNA algorithm
in Figure 3c. We try 3 different settings for slope s: s=1, s=3, and
annealing s from 1 to 3 (s=1→3, [5]). However, all the GRUs trained
A binary-activation, multi-level weight RNN and training algorithm for processing-in-memory inference with eNVM
σevalEvaluation noise
In
fe
re
nc
e 
ac
cu
ra
cy
 (%
)
σevalEvaluation noise σevalEvaluation noise
(a) (b) (c)
Figure 5: Comparisons of inference accuracy (showing mean±STD and max/min ranges of 100 Monte Carlo runs) vs σeval of
evaluation noise of LeNet5 trained with: (a) FP baseline (evaluated with both FP activations and BAs) vs NNA algorithm with
BAs and STE algorithm with BAs, different weight quantization levels with (b) NNA, and (c) with STE.
with STE have significantly worse performance compared to using
NNA (NNA(σL→σS )-BA-W7), which shows STE’s ineffectiveness
for binarizing activations in recurrent nets, thus the necessity and
advantage of our new NNA algorithm.
We should also point out an important distinction between the
forms of SBN (Equation 1) and our NCN (Equation 3): SBN only has
one parameter s , whereas NCN has two degrees of freedom using τ
and σtrain . On one hand, SBN uses s to control the sigmoid slope
(corresponding to 1/τ in NCN), and similar toτ , we find s needs to be
no greater than about 3, in order not to run into exploding gradient
problem. On the other hand, s also controls the stochasticity of
the Bernoulli sampling: a smaller s introduces more randomness
thus a higher noise resilience range, as can be seen from Figure 3c,
comparing s=1, s=3 and s=1→3. However, SBN cannot seperately
control the the sigmoid slope and the stochasticity. In contrast, our
NCN has independent controls: τ is chosen to approximate binary
outputs while avoiding exploding gradients, whereas the magnitude
of noise injection is seperately controlled by σtrain . This flexiblity
enables us to effectively implement NNA algorithm’s annealing
procedure using NCN.
From amathematical rigor point of view, in contrast to the Gauss-
ian RVs used in NCN, the Bernoulli RVs used in SBN do not comply
with the “location-scale” distribution required for using the repa-
rameterization trick [16]. Therefore, it is mathematically illegal for
STE to change the order between taking expectation and taking
derivative for Bernoulli RVs (Equation 2). Increasing the slope s can
alleviate the discrepancy between the forward and backward pass
of SBN (making the math less wrong, which explains the higher
accuracy with s=3 in Figure 3c), but due to the lack of separate con-
trols, changing s inevitably changes both the sampling randomness
and sigmoid’s gradient, and s cannot be too large which will cause
exploding gradients.
Validation with the hardware noise model.We model the hard-
ware performance when deploying our NNA trained BA-MLWGRU
on PIM circuits (Figure 2b) with PCM, ReRAM and CMOS-MLC,
using the noise model in Figure 2c. We evaluate the accuarcy of
NNA(σL→σS )-BA-W7, each weight implemented with a pair of 2-
bit (4-level) MLCs. With 100-point Monte Carlo simulations, Figure
4 shows the resulting accuracy distributions using violin plots. PCM,
ReRAM and CMOS-MLC achieve accuracy ranges ofmean ± STD
= (91.60 ± 0.34)%, (91.71 ± 0.27)%, and (91.60 ± 0.29)%, respectively,
validating that PIM circuits with all three eNVM technologies can
achieve performance comparable to the algorithm accuracy (Figure
3) even at the presense of device and circuit non-idealities.
5 TRAINING FEEDFORWARD BA-MLW NN:
LENET5 FOR MNIST
To demonstrate the effectiveness of our NNA algorithm on feedfor-
ward networks, we useMNIST dataset and train BA-MLWnetworks
using the LeNet5 architecture that comprises 2 CNN layers followed
by 3 FC layers [13]. We compare the accuracy and noise resilience
of the FP baseline with our NNA algorithm and STE, and the re-
sults are shown in Figure 5. Both STE and NNA are resilient to
binarizing activations, and achieve peak accuracies comparable
to the FP network and tolerate a wide range of σeval , with STE
slightly outperforming NNA (Figure 5a). Both STE and NNA are
also resilient to weight quantization (Figure 5b and 5c), with no
loss of accuracy when quantizing weights down to 15 levels (a
pair of 8-level cells), but slight accuracy degradation with 7-level
weights. Compared with the GRU in Section 4, LeNet5 needs more
quantization levels due to its wider weight distribution ranges and
smaller numbers of parameters in the CNN layers (especially the
first CNN layer). Although not elaborated in this paper, layer-wise
customized choices of quantization levels could further optimize the
performance. These results verify that although STE works poorly
for training BAs in RNNs, it is indeed effective for feedforward
networks. While on the other hand, our NNA algorithm works well
for both RNN and feedforward networks.
6 RELATEDWORK
Most of previous quantization studies have been focused on feed-
forward networks, whereas quantizing RNNs turns out to be more
challenging: consistent with our results, quantization techiques
that work well for feedforward NNs (eg., STE) have been found
to work poorly for RNNs [9]. Existing RNN quantization works
find that to maintain accuracy, more bits are required for RNNs
than for feedforward networks, especially for the activations: pre-
vious works either use FP activations [19], or need multiple bits
Siming Ma, David Brooks, and Gu-Yeon Wei
per activation [8][9], which would require costly DACs and ADCs
for PIM implementations. In contrast, our work not only quantizes
the weights but also binarizes activations of RNNs, enabling the
optimal BA-MLW RNN structure for efficient PIM implementations.
Our NNA algorithm is largely inspired by the reparameterization
trick [22] and the Gumbel-softmax trick [16][10]. Introduced in the
context of variational inference, the reparameterization trick refor-
mulates the sampling process of certain probability distributions
(eg., those having a “location-scale” form), which allows the ex-
pected gradient w.r.t. parameters of these distributions to propagate.
Gumbel-softmax uses the Gumbel RVs for attaining an equivalent
sampling process from categorical distributions, and furthermore, it
uses a continous relaxation trick to solve the gradient propagation
problem of sampling from discrete distirbutions. [17][11][1][21]
study the generalization effects of noise injection to NNs’ inputs,
weights, or activations. Additive Gaussian noise has also been used
for learning binary encodings of documents with a multi-layer
feedforward autoencoder [23]. Our paper differentiates from these
works in that we apply these techniques (noise injection and meth-
ods of propagating gradients through stochastic sampling nodes)
to training BA-MLW RNNs for the optimal PIM circuit implementa-
tions. Moreover, we discover the critical noise annealing procedure
in our NNA algorithm, and use noise injection’s regularization
penalty effects to explain why our new algorithm enables high
resilience to quantization and noise.
ACKNOWLEDGMENTS
This work was partially supported by the Applications Driving
Architectures (ADA) Research Center, a JUMP Center co-sponsored
by SRC and DARPA.
REFERENCES
[1] G. An. 1996. The effects of adding noise during backpropagation training on a
generalization performance. Neural computation 8, 3 (1996), 643–674.
[2] F. Bedeschi et al. 2008. A bipolar-selected phase change memory featuring
multi-level cell storage. IEEE JSSC 44, 1 (2008), 217–227.
[3] Y. Bengio et al. 2013. Estimating or propagating gradients through stochastic
neurons for conditional computation. arXiv:1308.3432 (2013).
[4] K. Cho et al. 2014. Learning phrase representations using RNN encoder-decoder
for statistical machine translation. arXiv:1406.1078 (2014).
[5] J. Chung et al. 2016. Hierarchical multiscale recurrent neural networks.
arXiv:1609.01704 (2016).
[6] GF. Close et al. 2010. Device, circuit and system-level analysis of noise in multi-bit
phase-change memory. In IEDM. 29–5.
[7] L. Fick et al. 2017. Analog in-memory subthreshold deep neural network acceler-
ator. In IEEE CICC. 1–4.
[8] Q. He et al. 2016. Effective quantization methods for recurrent neural networks.
arXiv:1611.10176 (2016).
[9] I. Hubara et al. 2017. Quantized neural networks: Training neural networks with
low precision weights and activations. The Journal of Machine Learning Research
18, 1 (2017), 6869–6898.
[10] E. Jang et al. 2016. Categorical reparameterization with gumbel-softmax.
arXiv:1611.01144 (2016).
[11] K.-C. Jim et al. 1996. An analysis of noise in recurrent neural networks: con-
vergence and generalization. IEEE Transactions on neural networks 7, 6 (1996),
1424–1438.
[12] W.-S. Khwa et al. 2018. A 65nm 4Kb algorithm-dependent computing-in-memory
SRAM unit-macro with 2.3 ns and 55.8 TOPS/W fully parallel product-sum
operation for binary DNN edge processors. In IEEE ISSCC. 496–498.
[13] Y. LeCun et al. 1998. Gradient-based learning applied to document recognition.
Proc. IEEE 86, 11 (1998), 2278–2324.
[14] F. Li et al. 2016. Ternary weight networks. arXiv:1605.04711 (2016).
[15] S. Ma et al. 2019. Fully-CMOS Multi-Level Embedded Non-Volatile Memory De-
vices With Reliable Long-Term Retention for Efficient Storage of Neural Network
Weights. IEEE Electron Device Letters 40, 9 (2019), 1403–1406.
[16] C. J. Maddison et al. 2016. The concrete distribution: A continuous relaxation of
discrete random variables. arXiv:1611.00712 (2016).
[17] A. Murray et al. 1994. Enhanced MLP performance and fault tolerance resulting
from synaptic weight noise during training. IEEE Transactions on neural networks
5, 5 (1994), 792–802.
[18] A. Ng et al. 2018. Recurrent neural network: gated recurrent unit (GRU). https:
//www.youtube.com/watch?v=xSCy3q2ts44
[19] J. Ott et al. 2016. Recurrent neural networks with limited numerical precision.
arXiv:1608.06902 (2016).
[20] M. Rastegari et al. 2016. Xnor-net: Imagenet classification using binary convo-
lutional neural networks. In European Conference on Computer Vision. Springer,
525–542.
[21] R. Reed et al. 1995. Similarities of error regularization, sigmoid gain scaling,
target smoothing, and training with jitter. IEEE Transactions on Neural Networks
6, 3 (1995), 529–538.
[22] D. Rezende et al. 2014. Stochastic backpropagation and approximate inference in
deep generative models. arXiv:1401.4082 (2014).
[23] R. Salakhutdinov et al. 2009. Semantic hashing. International Journal of Approxi-
mate Reasoning 50, 7 (2009), 969–978.
[24] A. Shafiee et al. 2016. ISAAC: A convolutional neural network accelerator with
in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture
News 44, 3 (2016), 14–26.
[25] S.-S. Sheu et al. 2011. A 4Mb embedded SLC resistive-RAM macro with 7.2 ns
read-write random-access time and 160ns MLC-access capability. In IEEE ISSCC.
200–202.
[26] L. Song et al. 2017. Pipelayer: A pipelined reram-based accelerator for deep
learning. In IEEE HPCA. 541–552.
[27] P. Warden. 2018. Speech commands: A dataset for limited-vocabulary speech
recognition. arXiv:1804.03209 (2018).
[28] S.-Y. Wu et al. 2014. An enhanced 16nm CMOS technology featuring 2 nd
generation FinFET transistors and advanced Cu/low-k interconnect for low
power and high performance applications. In IEEE IEDM. IEEE, 3–1.
[29] J. Zhang et al. 2016. A machine-learning classifier implemented in a standard 6T
SRAM array. In IEEE VLSI-Circuits. 1–2.
