We present a new algorithm for training neural networks with binary activations and multi-level weights, which enables efficient processing-in-memory circuits with eNVM. Binary activations obviate costly DACs and ADCs. Multi-level weights leverage multi-level eNVM cells. Compared with previous quantization algorithms, our method not only works for feed-forward networks including fullyconnected and convolutional, but also achieves higher accuracy and noise resilience for recurrent networks. In particular, we present a RNN trigger-word detection PIM accelerator, whose modeling results demonstrate high performance using our new training algorithm.
INTRODUCTION
Processing-in-memory (PIM) architectures for hardware neural network (NN) inference have gained increasing traction as they solve the memory bottleneck of traditional von Neumann architectures [24] . While PIM architectures apply to a variety of memories, including SRAM and eDRAM, it is especially advantageous to use emerging embedded non-volatile memories (eNVM), due to their higher storage density and multi-level-cell (MLC) capability [25] [2] [15] . Moreover, their non-volatility and power efficiency are especially well suited for inference tasks that require relatively fixed NN parameters. PIM with eNVMs not only avoids high-energy, long-latency, off-chip DRAM accesses by densely storing all NN parameters on-chip, but it also minimizes inefficient on-chip data movement and intermediate data generation by embedding the critical multiplication and accumulation (MAC) computations within the memory arrays.
The resulting highly-efficient MAC computations within the memory arrays are, however, analog in nature, which poses a few challenges. First, analog computations are vulnerable to noise from the memory devices and circuits, mandating sufficient noise resilience of the NN models. Second and more importantly, typical NNs use high-precision neuron activations that require DACs and ADCs (for feeding inputs to and resolving MAC outputs from the memory arrays, respectively), which introduce significant area, power and latency overhead. The die photo in [29] shows the area of 5-bit input DACs consume more than one third of the area of the SRAM PIM array, translating to even greater relative overhead if the SRAM were to be replaced by a much denser eNVM. A thorough design space exploration of ReRAM PIM architecture in [24] shows the optimal configuration is to share one 8-bit ADC among all the 128 columns of a 128x128 memristor array. Yet, this single ADC still occupies 48 times the area and consumes 6.7 times the energy of the entire memristor array. In fact, due to the prohibitively high cost of ADCs and DACs, many PIM designs resort to feeding and/or resolving activations 1 bit at a time in a sequential fashion, but this results in huge latency penalties [24] [26] . Such high overhead from ADCs and DACs completely defeats the original goal of attaining speed, power, and area efficiency by using PIM. In addition, the weights of conventional NNs usually require higher resolution (e.g., 8 bits) than are available in typical MLCs (e.g., 2 or 3 bits), which requires each weight to be implemented by combining many memory cells and compromises area efficiency [24] .
Reducing the bit precision of activations and weights can mitigate DAC and ADC overhead and reduce the number of cells needed for each weight. Hence, some PIM designs implement binary neural networks (BNN), with both binary weights and binary activations, obviating DACs and ADCs entirely, and only needing one singlelevel cell (SLC) per weight [12] . However, BNNs are not optimal for two reasons. First, the most popular BNN training algorithms use the straight-through estimator (STE) to get around BNN's indifferentiability problem during backpropagation [3] [9] [20] . However, as we will show in Sections 4 and 5, STE for binarizing activations is only effective for training feedforward NNs, such as fully-connected (FC) and convolutional NNs (CNN), but works poorly for recurrent NNs (RNN). Second, binary weights in BNNs are too stringent to maintain high inference accuracy, and do not take full advantage of the MLC capabilities of eNVMs.
Our major contributions are:
• To enable the ideal DAC/ADC-free PIM architecture with dense MLC eNVMs, we design a new algorithm to train NNs with binary activations (BA) that avoid DACs and ADCs, and multi-level weights (MLW) that take full advantage of dense MLCs. This algorithm achieves higher inference accuracy than using STE for training BA-MLW RNNs. • We design a DAC/ADC-free trigger word detection PIM accelerator with MLC eNVM, using a BA-MLW gated recurrent unit (GRU), which demonstrates our algorithm's high performance on RNNs. We build a noise model of the accelerator using eNVM measurement data and circuit simulations, and verify superior accuracy and noise resilience of our BA-MLW GRU compared with alternative algorithms. • We also apply our algorithm to LeNet5 for MNIST, which shows the effectiveness our BA-MLW algorithm on feedforward networks.
BACKGROUND
MLC eNVM for PIM. Before discussing the PIM architecture, it is necessary to introduce the technologies available for its critical building block -the memory array. Although the memory array can be built with conventional SRAM, it is more advantageous using eNVM, including traditional embedded Flash (eFlash) [7] , or emerging resistive RAM (ReRAM) [25] and phase change memory (PCM) [2] , or more recently, the purely-CMOS MLC eNVM (CMOS-MLC) Figure 1 : (a) The conventional PIM architecture with DACs and ADCs for high-precision activations and a combination of many cells for each high-resolution weight. (b) a DAC/ADC-free PIM architecture implementing BNNs, using SLCs for binary weights, and (b) the optimal DAC/ADC-free PIM architecture implementing our BA-MLW NNs, using a pair of MLCs for each MLW.
WL-DACs

BL-ADCs
W + i, j W − i, j W + i+ 1, j W − i+ 1, j W + i, j+ 1 W − i, j+ 1 W + i+ 1, j+ 1 W − i+ 1, j+ 1 Y j = ∑ i X i ⋅ W i, j Y j+ 1 = ∑ i X i ⋅ W i,
WL-drivers
[15]. Compared with SRAM which is inherently binary (single level cells, SLC), eNVM's SLCs offer much higher area efficiency, and eNVM is often analog in nature that enables MLC capability for even higher storage density. Programming eNVM typically involves a continuous change in the conductivity of the memory devices, enabling them to store multiple levels of transister channel current in the cases of eFlash and CMOS-MLC, or multiple levels of resistor conductivity in the cases of ReRAM and PCM. The programming speed of eNVM is much slower than SRAM, but the NN parameters are written only once and held constant during inference, rendering programming speed non-critical for inference-only applications. In fact, eNVMs' non-volativity offers energy savings and avoids needing to reload the weights at power-ups. PIM for NN inference. The conventional PIM architecture for NN inference is shown in Figure 1a [24] . A weight matrix of the NN is directly mapped into the memory array, and since for typical NNs, the weights usually require high resolutions (eg. ≥8 bits), multiple lower-resolution memory cells are combined to represent one weight, at the expense of area density [24] . This PIM structure can perform a matrix-vector multiplication in one step: all the input activations are fed into WLs simultaneously as voltage signals from the WL-DACs, generating a current through each memory cell that is proportional to the product of its input and conductance, and the MAC results are accumulated along corresponding BLs and resolved by column ADCs before being sent to nonlinearity function unit. For implementing conventional NNs with high-precision activations, the DACs and ADCs are necessary for translating input activations into analog voltage signals and digitizing BL currents into MAC outputs. However, as discussed in Section 1, these DACs and ADCs result in huge area and power overhead. Quantization and BNN. NN quantization algorithms have been proposed to reduce the bit widths of weights and/or activations while maximizing accuracy [9] [20] [14] , whose original goal is to save storage and computations. For PIM implementations, quantization algorithms can also relieve AD/DA resolution requirements for activations, and in particular, BNNs with 1-bit activations and weights, can translate into the much simpler PIM circuits in Figure  1b than the conventional (Figure 1a ). -since the activations are binary, WL-DACs and BL-ADCs in Figure 1a can be replaced by simple WL-drivers and sense-amp comparators, respectively, both of which are compact peripheral components in standard memories. However, PIM implementations of BNNs have two major problems. First, as shown in Figure 1b , the binary weights can only use eNVM cells as 1-bit SLCs, wasting the MLC capability. Second, the most popular existing algorithm for training binary activations uses STE [3] [9] (Section 3.1.1), which is only effective for feedforward NNs, but performs poorly on RNNs.
TRAINING BA-MLW NNS FOR OPTIMAL PIM IMPLEMENTATION
To avoid DACs and ADCs and fully leverage MLCs, we identify the BA-MLW NN structure as the optimal for PIM implementations, shown in Figure 1c . For simplicity, we show the memory devices as resistor cells (omitting access transistors), corresponding to ReRAM or PCM, with their conductivities encoding the weights; while they could easily be replaced by transistor cells, corresponding to eFlash or CMOS-MLC, whose channel currents encode the weights. The current differential of a pair of MLCs is used to represent one weight that could be either negative or positive. BAs only require WL-drivers and sense-amps instead of DACs and ADCs, and each MLW only uses one pair of MLCs for maximal storage density. To effectively train BA-MLW NNs, we develop a new algorithm that achieves high accuracy and resilience to quantization and noise not only for feedforward NNs, but also for RNNs.
Training binary activations (BA)
Binarizing the activations while maintaining high performance is challenging, because it not only restricts the expressive capacity of the neurons, but also introduces discrete computation nodes that preclude gradient propagation during training. We first review the previous STE algorithm, and then present our new and more effective BA training algorithm.
3.1.1 Reviewing STE. STE applies to the following stochastic binary neuron (SBN) [3] . During forward propagation of training, each neuron generates a binary output from a Bernoulli sample:
in which x is a pre-activation from the linear MAC, and the logistic sigmoid function has a tunable slope s [5] . The SBN function is discrete with random sampling, and thus does not have a well defined gradient, but STE simply passes through the gradient of the continuous sigmoid function during backpropagation:
in which L is the loss function. In other words, it ignores the random discrete sampling process, and pretends the forward propagation is done properly with a sigmoid function. The issue with STE is that propagating gradients w.r.t. the sample-independent mean (x S B N =sigmoid(s·x)) while ignoring the random sampling outcome can cause discrepancies between the forward and backward pass [10] . In fact, STE is a biased estimator of the expected gradient, and cannot even guarantee the correct sign when back-propagating through multiple hidden layers [3] . Nonetheless, people find STE works better in practice than other more complicated gradient estimators for feedforward NNs [3] , as we also verify in Section 5. However, as shown in Section 4, STE performs poorly when training RNN with BAs.
3.1.2 Our noisy neuron annealing (NNA) algorithm. We use the following noisy continuous neuron (NCN) function during the forward pass of training:
in which we add an iid zero-mean Gaussian random variable (RV) n t r ain ∼N (0, σ 2 t r ain ) to each pre-activation before passing into a continuous sigmoid function with temperature τ . Equation 3 can be broken down into two steps: the noise injection stepx=x + n t r ain , and a continuous relaxing step x N C N =sigmoid(x/τ ). The idea behind the noise injection step is that the effect of binarizing activations and quantizing weights is to introduce quantization errors that flow forward through MAC and convert into random noise added into pre-activations. Therefore if we train the NN with explicitly added noise into pre-activations, the NN would develop resilience to these quantization errors. The continuous relaxing step is inspired by the Gumbel-softmax trick [10] [16] , which uses a sharpened sigmoid to approximate the binary step function while still allowing smooth gradients to flow. As we will demonstrate more experimental details in Section 4, it is important to start from a large value of the hyperparameter σ t r ain to train with large noise, and later anneal it down to a small value, hence the name of our training strategy -"noisy neuron annealing" (NNA) algorithm.
As is also related to the additive noise in variational autoencoder, Gaussian noise distribution complies to the "mean and variance" form required by the re-parameterization trick [16] , and has a nice Gaussian gradient identity property [22] that allows it to change the order between taking expectation and taking derivative, such that the gradients can flow without encountering random sampling nodes. Combined with the continuous relaxation step, backpropagation through the entire NCN function can be done legitimately Table 1 : An example of mapping 7-level weights into the current magnitudes of a pair of 4-level cells. I f s is the full-scale current that corresponds to α, the weight clipping scale factor . I − cell and I − cell are the negative and positive cell currents.
without encountering any discrete or sampling node:
We use the following noisy binary neuron (NBN) for inference:
which also has an additive iid Gaussian RV n eval ∼N (0, σ 2 eval ), but replaces the continuous sigmoid in NCN with a discrete step function. In the experiments of Sections 4 and 5, we evaluate the noise resilience of trained NNs by sweeping σ eval .
Previous work has studied the regularization effect of noise injection regarding its impact on NNs' generalization and noise resilience [17] [11][1] [21] . They use a Taylor expansion on the loss function to derive that the effect of adding Gaussian noise is to add an extra regularization penalty term to the original loss function L, such that the effective loss becomes:
in which x i refers to a certain noise-injected node, and in our case, x i includes all the pre-activations. The regularization term P penalizes large gradients of L w.r.t. noise-injected nodes, encouraging these nodes to find "flatter regions" of the solution space that are less sensitive to noise perturbations, and hyperparameter σ t r ain controls the tradeoff between reducing the raw error L and enhancing noise resilience. Specifically, for NCN activations to counteract the noise, they tend to give up the highly-expressive but noise-prone transition region of the sigmoid, and develop a bimodal pre-activation distribution to push them into the saturated regions close to 1 or 0 that are highly immune to noise [23] . Equation 6 also provides a quantitative metric to estimate the noise resilience a NN acquires during training, and we derive a detailed form to calculate this penalty term in Section 4 for comparing amongst NNs.
Training multi-level weights (MLW)
Our NNA algorithm not only endows the NN with high resilience to binarizing activations, but also enables MLWs to leverage dense MLCs with high inference accuracy. Each weight can be quantized down to a small number of levels capable of encoding with one pair of MLCs (Figure 1c ), as opposed to needing to combine more memory cells for high-resolution weights (Figure 1a ). To quantize MLWs from full-precision (FP) weights, we first determine a suitable clipping range [−α, α] for each weight matrix based on the weight distribution statistics from pre-training with FP weights. Then during fine-tuning with weight quantizations, we first clip each weight matrix into [−α, α], then quantize the weights into evenly spaced levels within this range. We follow the same practice as [9] for training -we use the quantized weights in the forward pass, but still keep the FP weights and accumulate gradients onto FP weights in the backward pass. After training is done, the FP weights can be discarded, and only the quantized weights are used in inference. For the special cases of 3 and 2-level weights, we use the training algorithm in [14] for 3-level (ternary) and [20] for 2-level (binary) weights.
Note that the matrix-wise scale factors α is only relevant during training, since they impact the gradient values, but when we use a binary step activation function for inference, only the relative values of weight levels matter. So for mapping the weight matrices into PIM arrays' current magnitudes, we can re-scale the weights into suitable MLC current magnitudes so long as their relative ratios are maintained. Table 1 shows the example of using the current differential of a pair of 4-level (2-bit) MLCs to represent a 7-level weight. Following the same principle, a 15-level weight can be encoded with a pair of 8-level (3-bit) MLCs, and a 3-level weight can use a pair of binary cells (1-bit, SLC).
A TRIGGER WORD DETECTION PIM ACCELERATOR USING BA-MLW GRU WITH MLC ENVM
Trigger word detection is an important always-on task in speech activated edge devices, for which power and cost efficiency is paramount. We use the Speech Commands dataset [27] , which consists of over 105,000 audio clips of various words uttered by thousands of different people, with a total of 12 classification categories: 10 designated keywords, silence, and unknown words.
The NN model structure. RNN is well suited to this speech recognition task, and in particular, we use a 2-layer gated recurrent unit (GRU, [4] ) with BA and MLW, shown in Figure 2a . We first perform FFT (window=16ms, stride=8ms) on the raw audio signals, and then use MFCC to extract 40 coefficients per 8ms timestep. Each MFCC vector passes through the input FC layer that encodes it into a 128 dimensional binary vector as the input to the first layer of a 2-layer stacked GRU (both layers use 128 dimensional vectors). After the GRU processes inputs from all the timesteps, the final timestep output of the top layer is fed into an output FC layer followed by 12-way softmax to get a classification result. We use the following version of GRU equations [18] :
G l <t >=f ( G l <t >)
C l <t >=f ( C l <t >)
in which t denotes the timestep, l is the layer number, and the gate G l <t > (l=1, 2), candidate C l <t > (l=1, 2), hidden state H l <t > (l=1, 2), and the input encoding H 0 <t > are all 128 dimensional activation vectors, trained using NNA algorithm. The activation function f refers to NCN (Equation 3) during training, and NBN (Equation 5) for evaluation, and simply uses the binary step function for PIM deployment (Figure 2a and 2b) . Compared with the original forms of GRU equations [4] , we remove the reset gate since we find it has minimal effect on the accuracy of this task, and this version of GRU equations greatly simplifies the circuit designs (Figure 2b) . PIM circuit design for the GRU. The BA-MLW GRU equations can be mapped into very elegant PIM circuits shown in Figure  2b . After the MLC array finishes MAC and resolves the binarized G l <t > and C l <t > in the sense-amps, G l <t > serves as the multiplexier selection signal (since 1 − G l <t >=!G l <t > in Equation 11 for binary signals) to choose either to keep the binary hidden state saved from the previous timestep, or to update it with the current binary candidate state. There is no need for ADC, DAC, or any explicit analog signal processing for the GRU computations, since all the analog MAC signals are encapsulated inside the MLC array with all the input/output interface signals being binary, and the GRU logic outside the array is totally digital.
Hardware noise model. Figure 2c shows how we model the hardware noise from eNVM devices and the circuits. For the eNVM devices, we consider ReRAM, PCM and CMOS-MLC, using the measured data from [25] , [2] , and [15] , respectively. Non-idealities of the eNVM devices include the static distributions of different MLC levels after programming, drift during retention, and random telegraphic noise (RTN) [6] , all of which have much slower frequency ranges than the intended circuit clock rate (>10∼100MHz), so we lump these noise sources into a static error model of MLC devices. For modeling the circuit noise, we simulate with TSMC's 16nm FinFET process [28] , and consider thermal and shot noise from the memory array and load devices, which are wide-band white noise sources manifesting as kT /C noise sampled at sense-amp's input. We use the offset-canceling sense-amp from [12] , and the total sampling capacitance is estimated to be C I N + C BL ≈2f F , contributing kT /C noise with STD=2mV sampled at the differential input of sense-amp. Based on the NN algorithm's statistics, the preactivation distributions have maximum values between −30∼+30, and we allocate a dynamic voltage range of −600mV ∼+600mV for the sense-amp's differential input to accommodate this algorithmic range. Therefore, the 2mV kT /C noise translates to dynamic algorithmic noise of magnitude σ kT /C =0.1 added to pre-activations. We consider both the static errors from eNVM devices and the dynamic noise from the circuits when we verify the hardware performance using our noise model. Noise-induced loss penalty terms for GRU. Using the GRU equations, we can derive the noise-induced regularization penalty term P in Equation 6 by taking the derivatives of loss L w.r.t. the preactivations of candidates ( C l <t >) and gates ( G l <t >). To minimize Pд, one way is to reduce the derivative of the gate (sigmoid ′ ( д l i <t >)) by pushing д l i <t > away from zero (the steep slope region of sigmoid), so that the gate is firmly on or firmly off; alternatively, it can try to make the candidate of the current timestep c l i <t > equal to the hidden state of the previous timestep h l i <t − 1>, such that the new hidden state h l i <t > would be the same no matter д l i <t > is 1 or 0 -either way, minimizing Pд can make the computation result of h l i <t > immune to noise injected to д l i <t >. Similarly for minimizing Pc, it will either reduce the gradient of candidate (sigmoid ′ ( c l i <t >)) by pushing c l i <t > into the saturated flat regions of sigmoid, or it tries to turn on the gate д l i <t > to preserve the hidden state from the previous timestep h l i <t − 1> without caring about the new candidate c l i <t > -either way, it desensitizes h l i <t > to noise injected in c l i <t >. Equation 12, 13 and 14 provide a quantitative metric to assess the resilience to quantization and noise in a trained GRU, and we use them for comparing training schemes in the next section. Training scheme comparisons. As a full-precision (FP) baseline, we first train the NN with FP sigmoid activations and FP weights, without noise injection or quantization. During inference, we add noise n eval ∼N (0, σ 2 eval ) to its pre-activations and evaluate it with both FP sigmoid activations (baseline-A F P -W F P , red), and binarized activations (baseline-BA-W F P , yellow), shown in Figure 3a . Trained without noise injection, baseline-A F P -W F P cannot maintain its high accuracy at large σ eval , making it vulnerable to quantization errors, leading to the poor performance of baseline-BA-W F P .
To endow the NN with resilience to quantization errors, we use our NNA algorithm and retrain from the FP sigmoid baseline (which (Figure 2c ). We run each experiment 100 times with different samples of static MLC device errors and dynamic circuit noise, and show the accuracy distributions using the violin plots.
is a good initialization that speeds up retraining). Initially, we use a large training noise σ t r ain =σ L =1.6, and plot the inference accuracy with BAs and 7-level weights (NNA(σ L )-BA-W 7 ) in Figure 3a . It indeed gains a much wider tolerance range of σ eval , and thus stronger resilience to quantization errors and higher accuracy than baseline-BA-W F P . However, the accuracy peaks at a large σ eval around σ L , and drops at small σ eval , because it is trained to minimize its loss at the present of this large additive noise. This noise-resilience profile might suit certain severely noisy circuit environments, for example, large power supply noise. However, to also achieve high accuracy at small σ eval , we anneal σ t r ain down to a small σ t r ain =σ S and further retrain it. As shown in Figure 3a (NNA(σ L →σ S )-BA-W 7 ), peak accuracy further improves and is achieved at smaller σ eval as well. NNA's annealing procedure from σ L to σ S is critical. As shown in Figure 3a (NNA(σ S )-BA-W 7 ), directly retraining with σ S from the baseline without the intermediate σ L stage results in much worse accuracy and noise tolerance than NNA(σ L →σ S )-BA-W 7 .
To understand why the annealing procedure of NNA is critical, we compare the loss penalty terms (Pд + Pc from Equation 13 and 14) of the 4 networks in Figure 3a . Shown in Table 2 , we normalize them by the penalty value of baseline-A F P -W F P , since only the relative values matter for comparison. After retraining with σ L , NNA(σ L )-BA-W 7 's penalty term reduces to half of the baseline penalty, explaining its higher resilience to noise and quantization errors. After further retraining with annealed σ S , NNA(σ L →σ S )-BA-W 7 maintains this small penalty value, even though σ S provides less regularization effect -the smaller multiplier σ 2 S in Equation 12 (compared to previous σ 2 L ) makes the retraining prioritize more on reducing the raw error (thus higher peak accuracy) rather than enhancing noise resilience. This means the network can still "memorize" its previous large noise training's regularization effect even after annealing to fine-tune with smaller noise. In contrast, without the intermediate "experience" of large noise training, NNA(σ S )-BA-W 7 has much less regularization effect to reduce its loss penalty. It should also be pointed out that if evaluated with FP activations and zero noise, all these networks achieve similarly high accuracy as the purely FP sigmoid network, but their accuracy and noise resilience are dramatically different after using binary activations. This implies that trained through different stages, the 4 networks in Figure 3a find distinct regions in the global solution space: retraining with σ L finds a promising solution region that's insensitive to quantization errors, while the further fine-tuning with σ S only does a local search to optimize accurcy at the small noise range; in contrast, training without noise injection or only with small noise will not discover the solution region that's resilient to noise and quantization due to lack of regularization penalty.
As was also mentioned in section 3.1.2, the reason that tolerance to additive noise leads to resilience to quantization is that the effect of binarizing activations and quantizing weights is to introduce quantization errors in the NN computations. Based on central limit theorum, during MAC, the summation of these quantization errors (which are iid RVs) forms a Gaussian noise distribution added to pre-activations. Therefore, if the NN is resilient to additive Gaussian noise, it also tends to be resilient to quantization errors. Theoretically, after MAC, the effective standard deviation (STD) of noise due to the summation of activation binarization errors and weight quantization errors can be expressed as:
respectively, in which W , X , are weights and inputs; σ a and σ w are the average quantization error STDs of a single element of activation and weight, respectively. During training with NNA algorithm, the statistics of W , X , σ a and σ w change dramatically to achieve noise resilience, and in practice, we find the choices of the hyperparameter σ t r ain quite flexible. For the initial large noise, we use a σ L to be about 20% of the STD of the inherent pre-activation distribution, corresponding to σ L =1.6 for this GRU, though a wide range of values all work similarly well; for the annealed noise, we find σ S =0∼0.5 all achieves optimal final results. For temperature τ in NCN, we find 0.3 to be optimal: too small will cause RNN's exploding gradient problem, while too large cannot proximate BA well enough. Weight quantization levels. Figure 3b shows GRU's performance using different numbers of weight quantization levels, which confirms the high resilience to weight quantization using our NNA algorithm. The performances of 7-level (implemented with a pair of 4-level cells) and 5-level (a pair of 3-level cells) quantizations are almost the same as using FP weights, while 7-level weights achieve the highest peak accuracy at the lower end of σ eval range. Ternary weights (a pair of 2-level cells) have degradation especially in the large σ eval range, but still maintain decent accuracy close to FP weights in the small σ eval range. Binary weights (can be implemented with a SRAM cell), however, have more serious accuracy degradation in both low and high σ eval ranges. Our NNA algorithm's high resilience to weight quantization makes it possible to use a pair of 4 or 3-level MLCs to achieve FP weight performance, or even 2-level cells, if the hardware noise is within the small σ eval range. This avoids the need to combine multiple cells to achieve high-resolution weights (Figure 1a , [24] ), which greatly saves area, and simplifies the memory array and logic design. Our NNA algorithm vs STE. We also experiment with SBN trained with STE, and compare its performance with our NNA algorithm in Figure 3c . We try 3 different settings for slope s: s=1, s=3, and annealing s from 1 to 3 (s=1→3, [5] ). However, all the GRUs trained with STE have significantly worse performance compared to using NNA (NNA(σ L →σ S )-BA-W 7 ), which shows STE's ineffectiveness for binarizing activations in recurrent nets, thus the necessity and advantage of our new NNA algorithm. We should also point out an important distinction between the forms of SBN (Equation 1) and our NCN (Equation 3): SBN only has one parameter s, whereas NCN has two degrees of freedom using τ and σ t r ain . On one hand, SBN uses s to control the sigmoid slope (corresponding to 1/τ in NCN), and similar to τ , we find s needs to be no greater than about 3, in order not to run into exploding gradient problem. On the other hand, s also controls the stochasticity of the Bernoulli sampling: a smaller s introduces more randomness thus a higher noise resilience range, as can be seen from Figure 3c , comparing s=1, s=3 and s=1→3. However, SBN cannot seperately control the the sigmoid slope and the stochasticity. In contrast, our NCN has independent controls: τ is chosen to approximate binary outputs while avoiding exploding gradients, whereas the magnitude of noise injection is seperately controlled by σ t r ain . This flexiblity enables us to effectively implement NNA algorithm's annealing procedure using NCN.
From a mathematical rigor point of view, in contrast to the Gaussian RVs used in NCN, the Bernoulli RVs used in SBN do not comply with the "location-scale" distribution required for using the reparameterization trick [16] . Therefore, it is mathematically illegal for STE to change the order between taking expectation and taking derivative for Bernoulli RVs (Equation 2 ). Increasing the slope s can alleviate the discrepancy between the forward and backward pass of SBN (making the math less wrong, which explains the higher accuracy with s=3 in Figure 3c ), but due to the lack of separate controls, changing s inevitably changes both the sampling randomness and sigmoid's gradient, and s cannot be too large which will cause exploding gradients. Validation with the hardware noise model. We model the hardware performance when deploying our NNA trained BA-MLW GRU on PIM circuits (Figure 2b) with PCM, ReRAM and CMOS-MLC, using the noise model in Figure 2c . We evaluate the accuarcy of NNA(σ L →σ S )-BA-W 7 , each weight implemented with a pair of 2bit (4-level) MLCs. With 100-point Monte Carlo simulations, Figure  4 shows the resulting accuracy distributions using violin plots. PCM, ReRAM and CMOS-MLC achieve accuracy ranges of mean ± ST D = (91.60 ± 0.34)%, (91.71 ± 0.27)%, and (91.60 ± 0.29)%, respectively, validating that PIM circuits with all three eNVM technologies can achieve performance comparable to the algorithm accuracy ( Figure  3 ) even at the presense of device and circuit non-idealities.
TRAINING FEEDFORWARD BA-MLW NN: LENETFOR MNIST
To demonstrate the effectiveness of our NNA algorithm on feedforward networks, we use MNIST dataset and train BA-MLW networks using the LeNet5 architecture that comprises 2 CNN layers followed by 3 FC layers [13] . We compare the accuracy and noise resilience of the FP baseline with our NNA algorithm and STE, and the results are shown in Figure 5 . Both STE and NNA are resilient to binarizing activations, and achieve peak accuracies comparable to the FP network and tolerate a wide range of σ eval , with STE slightly outperforming NNA (Figure 5a ). Both STE and NNA are also resilient to weight quantization (Figure 5b and 5c), with no loss of accuracy when quantizing weights down to 15 levels (a pair of 8-level cells), but slight accuracy degradation with 7-level weights. Compared with the GRU in Section 4, LeNet5 needs more quantization levels due to its wider weight distribution ranges and smaller numbers of parameters in the CNN layers (especially the first CNN layer). Although not elaborated in this paper, layer-wise customized choices of quantization levels could further optimize the performance. These results verify that although STE works poorly for training BAs in RNNs, it is indeed effective for feedforward networks. While on the other hand, our NNA algorithm works well for both RNN and feedforward networks.
RELATED WORK
Most of previous quantization studies have been focused on feedforward networks, whereas quantizing RNNs turns out to be more challenging: consistent with our results, quantization techiques that work well for feedforward NNs (eg., STE) have been found to work poorly for RNNs [9] . Existing RNN quantization works find that to maintain accuracy, more bits are required for RNNs than for feedforward networks, especially for the activations: previous works either use FP activations [19] , or need multiple bits per activation [8] [9] , which would require costly DACs and ADCs for PIM implementations. In contrast, our work not only quantizes the weights but also binarizes activations of RNNs, enabling the optimal BA-MLW RNN structure for efficient PIM implementations. Our NNA algorithm is largely inspired by the reparameterization trick [22] and the Gumbel-softmax trick [16] [10] . Introduced in the context of variational inference, the reparameterization trick reformulates the sampling process of certain probability distributions (eg., those having a "location-scale" form), which allows the expected gradient w.r.t. parameters of these distributions to propagate. Gumbel-softmax uses the Gumbel RVs for attaining an equivalent sampling process from categorical distributions, and furthermore, it uses a continous relaxation trick to solve the gradient propagation problem of sampling from discrete distirbutions. [17] [11][1] [21] study the generalization effects of noise injection to NNs' inputs, weights, or activations. Additive Gaussian noise has also been used for learning binary encodings of documents with a multi-layer feedforward autoencoder [23] . Our paper differentiates from these works in that we apply these techniques (noise injection and methods of propagating gradients through stochastic sampling nodes) to training BA-MLW RNNs for the optimal PIM circuit implementations. Moreover, we discover the critical noise annealing procedure in our NNA algorithm, and use noise injection's regularization penalty effects to explain why our new algorithm enables high resilience to quantization and noise.
