We trained three Binarized Convolutional Neural Network architectures (LeNet-4, Network-InNetwork, AlexNet) on a variety of datasets (MNIST, CIFAR-10, CIFAR-100, extended SVHN, ImageNet) using error-prone activations and tested them without errors to study the resilience of the training process. With the exception of the AlexNet when trained on the ImageNet dataset, we found that Bit Error Rates of a few percent during training do not degrade the test accuracy. Furthermore, by training the AlexNet on progressively smaller subsets of ImageNet classes, we observed increasing tolerance to activation errors. The ability to operate with high BERs is critical for reducing power consumption in existing hardware and for facilitating emerging memory technologies. We discuss how operating at moderate BER can enable Magnetoresistive RAM with higher endurance, speed and density.
I. INTRODUCTION
Artificial Neural Networks (ANNs) are biologyinspired concepts that have in recent years revolutionized many areas of research and industry and even much of everyday life. Managing their power consumption has been one of the key challenges that has accompanied their emergence and especially the advent of Deep Neural Networks (DNNs). When considering the analogy with biological intelligence we find that biology needs 4 to 5 orders of magnitude less power, primarily due to to its synaptic operation energy efficiency, at the "expense" of nearly 75% synaptic error rate [23] . In this paper we explore how the presence of errors during training can impact the classification accuracy and discuss how operating at moderate Bit Error Rate (BER) facilitates Magnetoresistive RAM (MRAM) technology for ANN applications.
In all-perpendicular Spin Transfer Torque MRAM (STT-MRAM) [10, 16] , a bit is stored in a Magnetic Tunnel Junction (MTJ) comprising two ferromagnetic layers separated by a thin insulating barrier. The magnetization vectors of the two ferromagnets are perpendicular to the plane of the layers and may be in a parallel (P) or antiparallel (AP) configuration. When electrical current passes though one of the ferromagnets it gets spin-filtered and the spin-polarized electrons impart spin torque [1, 24] on the other. In order to read a bit of information we must supply enough current to identify whether the MTJ is in the P (low-resistance) or AP (highresistance) state but not so high that the spin torque disturbs the magnetization of either of the layers. Writing a bit requires higher current than reading because one must produce enough spin torque to flip the magnetization of one of the ferromagnets; yet too high a voltage across the MTJ stresses the insulator material and degrades its endurance. The switching process is inherently stochastic and the switching probability can be calculated analytically given the MTJ parameters and the read/write pulse * mtzoufras@physics.ucla.edu amplitude and duration [26] . MRAM exhibits many advantages compared to conventional memories, including non-volatility, high endurance and high density, but having to contend with its stochasticity remains a major obstacle to widespread adoption. Therefore, architectures and applications that are resilient to errors are the best candidates for MRAM.
Approximate computing [17, 18, 27] has been proposed as a way to trade classification accuracy for energy efficiency in inference tasks. The accuracy-power trade-off was first studied in silicon by Yang and Murmann [28] , using SRAM with reduced voltage supply to train and test a three-layer Convolutional Neural Net (ConvNet) on the low-complexity MNIST handwritten digit dataset. The presence of BER due to sub-threshold voltage during training produced an increase in the classification accuracy when the SRAM was operated similarly for testing. In Ref. [29] , it was shown that a deeper ConvNet trained on a moderate-complexity dataset, the CIFAR-10, is also resilient to bit errors during inference, albeit less than the three-layer ConvNet trained on MNIST. In 2018, a framework was developed to study DNN resilience during inference [22] and potential sources of errors were identified in SRAM, DRAM and flash memory.
Apart from hardware errors, the common practice of limiting the number representation and employing fixedpoint arithmetic in neural network applications introduces quantization noise. This approach reduces both memory and compute requirements and has been studied extensively since the 1990s [7, 8, 11] . Recently, Gupta et al. [6] demonstrated that stochastic rounding yields superior performance when using low-precision fixed-point computations compared to the standard round-to-nearest method. Stochastic rounding is also seen as the preferred approach for the extreme case of binary representation that has been garnering increasing interest for inference applications. In Refs. [4, 9, 20] , several training algorithms were developed that enable Binarized Neural Networks (BNNs) to achieve-along with drastic reduction in power consumption-classification accuracy comparable to non-binarized networks. Moreover, binarization of the convolution in ConvNets turns it into an XNOR operation which leads to further enhancement in speed and energy efficiency. Accordingly XNOR-Nets are excellent candidates for edge applications, where density and power are most constrained.
Stochastic rounding for a BNN takes the form shown in Ref. [4] :
where σ is the "hard sigmoid" function:
2 )), a linear function that performs stochastic rounding in the same manner as suggested in Ref. [6] .
However, generating the plethora of random numbers needed for stochastic rounding is not practical in most systems and round-to-nearest is usually chosen. Due to its stochastic nature, an MTJ can be used as an alternative tunable true random number generator for stochastic rounding but this also introduces unwieldy complexity in the circuit, namely a digital-to-analog converter to provide the current that corresponds to the desired switching probability. Instead, we examine what happens when the MRAM is operated at a constant reduced voltage, i.e. at fixed BER. This involves no additional complexity compared to standard MRAM. We may write the rounding function due to the MTJ stochasticity as:
x P = −1 with probability 1 − p +1 , +1 with probability p +1
where p −1 and p +1 indicate successful write of the AP and P states respectively. Below we assume for simplicity that p −1 = p +1 = p. Aside from their importance for edge applications, XNOR-Nets are suitable for isolating the effect of BER in neural networks because one does not need to worry about protecting the most significant bits or exploring various fault mitigation strategies [21] . For the XNORNets studied in this article, when a bit error occurs we ignore it and make no attempt at detecting, mitigating or correcting it.
Network weights and activations are known to have different tolerance to errors and the same holds true for the individual network layers and training epochs. Specifically, weights are expected to be less resilient to BER than activations such that the effect of weight errors would dominate the outcome if the same BER was present across all of the network variables. Here we only allow for bit errors in the binary activations during training with constant BER across all binary layers and epochs. Our guiding principle is to study the effect of BER in ANNs in the most transparent conditions. Future work will explore the effect of BER on weights as well as the combined effect of weight and activation errors.
II. TRAINING XNOR-NETS WITH BER IN THE BINARY ACTIVATIONS
We conducted experiments on three binarized ConvNets and several datasets of increasing complexity: namely the binarized LeNet-4 on the MNIST dataset (section II A), the binarized Network-in-Network on the CIFAR-10, the CIFAR-100 and the extended SVHN datasets (section II B), and finally the binarized AlexNet trained on the ImageNet dataset as well as several ImageNet subsets (section II C).
A. We first present the classic combination of a LeNet Convolutional Neural Net architecture [14] , one of the simplest ConvNets, training on the low-complexity MNIST dataset. We binarized a modified LeNet-4 architecture comprising: (I) a regular convolution layer, with batch normalization and ReLu activation followed by a max-pooling layer, (II) a binary convolution block that comprises batch normalization, binary activation, and binary convolution followed by a max-pooling layer, (III) a binary fully connected layer, and (IV) a softmax classification layer. This network was trained on the 60,000 train images of the MNIST handwritten digit dataset and tested on its 10,000 test images. During training the filter weights were left error-free while the binary activations exhibited a fixed BER. We examined BERs between 0% and 16% and repeated the training process 10 times for each value of BER.
FIG. 1.
Test accuracy vs training epochs for a binarized LeNet-4 network trained on the MNIST dataset.
The test accuracy is shown in Figure 1 for all of the above experiments and the average for each BER value as well as the individual traces are displayed to give a sense of the spread between consecutive runs. The accuracy gradually improved when raising the BER from BER = 0% (no errors) to BER = 4%, and plateaued be-tween BER = 4% and BER = 8%. Increasing the BER beyond this point showed a reduction in the test accuracy. Interestingly, at BER = 16% the test accuracy was still higher than in the case where no errors were included during training, highlighting the robustness of the training process to the presence of activation errors. In Ref. [28] it was found that matching error rate distributions between training and testing can improve classification accuracy. In contrast, here we find that even without errors during testing the classification accuracy is higher than in the error-free case when the BER ≤ 16%. (We note however that in Ref. [28] the entire memory, SRAM, was operated at low voltage, while we only studied activation errors.)
To validate these findings in a more elaborate architecture combined with datasets of higher complexity, we studied the effect of activation errors in the binarized Network-In-Network (NiN) [15] , a classic architecture that inspired the Inception Networks [25] , using the CIFAR-10, CIFAR-100 and extended Street View House Numbers (SVHN) datasets. The binarized NiN comprises three stages, each stage having three convolution layers followed by a pooling layer (max-pooling, averagepooling, average-pooling respectively for each stage). All convolution layers were binarized except the first and last ones, where ReLu activations were used. The activations of the binarized layers were subject to BER.
The CIFAR-10 dataset is of moderate complexity and contains 50,000 train and 10,000 test images in RGB with size 32 × 32 that belong to 10 classes. Figure 2a shows the test accuracy when using BER between 0% − 8% for the binary activations. The results plotted are averages over 10 experiments and the standard deviation is also included. We find that for low BER values, up to BER = 2%, there is no noticeable degradation in test accuracy but at BER = 4% there is a visible drop and at 8% the drop is very significant.
The CIFAR-100 dataset comprises 100 classes with 500 train and 100 test images per class in RGB with size 32 × 32. Due to the higher number of classes and smaller number of examples per class compared to the CIFAR-10 dataset we achieved lower test accuracy when training the binarized NiN on the CIFAR-100. The results (averages over 10 experiments) are shown in Figure 2b . Similarly to the two previous cases we observe an initial rise of the test accuracy combined with a drop below the maximum for BER = 4%. Additionally we note that optimal performance was reached for BER = 2% and that for higher BERs, e.g. BER = 8%, the standard deviation was visibly reduced.
The same NiN architecture was trained on the extended SVHN dataset, which contains 531,131 train and 26,032 test images, size 32 × 32, RGB, belonging to 10 classes, one for each digit. This is a more complex dataset Test accuracy vs training epochs for a binarized Network-In-Network trained on (a) the CIFAR-10, (b) the CIFAR-100, and (c) the extended SVHN datasets. For the latter dataset we present the raw data from the experiment.
than MNIST and it contains a much larger number of train images. The findings of this experiment are akin to the previous experiments and displayed in Figure 2c . We find a slight improvement in test accuracy with increasing BER up to 4% followed by a drop when further raising the BER.
C. AlexNet
We now turn to the ImageNet Large-Scale Visual Recognition Challenge which contains a train set of more than 1.2M images and a test set of 60,000. This dataset includes 1000 categories of about 1000 images each, with size 224 × 224. We trained a binarized AlexNet architecture [13, 20] which incorporates 5 convolutional layers, the first of which is the only one that is not binarized. Max-pooling layers are used after the first, second and fifth convolutional layers. This implementation achieved a Top1 classification accuracy of 44.07%, virtually identical to the one reported in Ref. [20] .
In contrast to our experiments in sections II A-II B there is no discernible increase in test accuracy when raising the binary activation BER up to 2% during training. At BER = 4% there was a noticeable drop and beyond 4% the performance continues degrading rapidly. Results of this training process are shown in Figure 3a .
In order to isolate the influence of the network architecture from the complexity of the dataset we selected a random 100-class subset of the 1000-class dataset and repeated the training for various BER values. Each experiment was run 10 times and the average along with the typical dispersion are shown in Figure 3b . The test accuracy exceeded 60% because there were fewer classes and therefore fewer semantic neighbors. Unlike the 1000-class dataset, the 100-class subset showed no significant decline in test accuracy up to BER = 8%. A second 100-class subset was randomly chosen (not shown) and the experiment qualitatively replicated the behavior seen in Figure 3b from the first 100-class subset.
In a subsequent experiment we used a randomlyselected 10-class subset of ImageNet, further increasing the semantic distance between classes. In Fig 3c, the training process shows enhanced resiliency to BER compared to the 100-class subset. No degradation in accuracy was seen up to BER = 16%. A second experiment (not shown) using a separate randomly-selected 10-class subset of ImageNet replicated this behavior.
Finally we examined the extreme case of a 2-class subset of ImageNet. For each BER value we repeated the experiment 10 times and the mean along with the typical dispersion are shown in Figure 4a . We then randomly selected four additional 2-class subsets and followed the process described above to study the variability of the results. In Figure 4b we show the average Top1Max value for each of the five 2-class subsets and for each BER. Remarkably there was no degradation in test accuracy up to BER of ∼ 32%, with 50% being the value that corre- sponds to complete randomness in the binary activations at which point the test accuracy falls to ∼ 50%. Overall we observe increasing resilience of the training process to BER in the binary activations as we progressively reduce the number of classes in the system.
III. OPERATING MRAM AT MODERATE WRITE ERROR RATE A. Stochastic errors in MRAM
In reading or writing an MRAM bit, i.e. an MTJ, there are upper and lower bounds to the voltage amplitude and pulse-length. Specifically:
• When reading, the voltage must be high enough and applied long enough to facilitate detection of the MTJ state but not so high/long that it would accidentally switch the MTJ.
• When writing, the voltage must be high enough and applied long enough to ensure the information is written correctly but not so high/long that it would excessively stress (or break) the MTJ.
In minimizing the error rates we must consider the trade-off between errors and MRAM properties such as speed, density and endurance. For example using long low-amplitude pulses widens the operation windows for both read and write at the cost of speed; increasing the MTJ device diameter makes the device more stable at the cost of lower memory density. The main categories of errors in MRAM bits are the following:
(a) Write errors, which occur at a low rate when the voltage amplitude is high and/or the pulse is long enough that the associated spin-polarized current has a high probability of switching the MTJ state. For small devices, where macrospin theory applies, we can determine the switching probability from the voltage pulse and the MTJ parameters using formulas (11)- (12) in Ref. [26] .
(b) Breakdown occurs when the voltage amplitude is so high (or the pulse so long) that the MTJ thin insulator material is stressed excessively. Semiempirical models [3, 12] have been developed to describe the device endurance, which is generally found to increase dramatically with the reduction of the voltage amplitude, e.g. using 20% lower write voltage we can raise the number of cycles (N c ) by up to 6 orders of magnitude [2] .
(c) Retention errors occur when the MTJ is idle because of spontaneous thermal activation. Smalldiameter and/or low-magnetic-anisotropy devices exhibit poor retention. We can calculate the retention error by applying the same formulas as for the write error with zero current. Alternatively we can use the Néel-Arrhenius model [19] .
(d) Read errors occur when the voltage amplitude is not high enough (or the read pulse is not long enough) to allow the sense amplifier to detect the resistance state of the MTJ. These errors are not due to the inherent MTJ stochasticity.
(e) Read disturb errors occur when the read voltage is so high (or the read pulse so long) that there is a probability of accidentally switching the MTJ while attempting to read it. Read disturb is an inadvertent write and for small devices the read disturb error rate can be calculated with the same formulas as the write error rate.
The operation window for the read process is determined by (d)-(e) and for writing by (a)-(b). One of the key advantages of MRAM compared to other nonvolatile memory technologies is its potential to achieve almost unlimited endurance because the number of MTJ write cycles increases rapidly as the ratio V write /V bd reduces, where V bd is the "breakdown voltage", the value beyond which the MTJ breaks. On the other hand, the Write Error Rate (WER) of the device is a monotonically decreasing function of V write /V c0 , where V c0 is a characteristic "switching voltage", so that the ratio V write /V c0 must be large enough for the WER to attain a specified value. V bd and V c0 are both functions of the MTJ parameters.
Special circuit techniques exist to reach WER 10
and endurance levels N c 10 13 , worthy of SRAM and DRAM replacement [5] . Alternatively, to attain an error rate suitable for applications ( 10 −15 ) the write voltage amplitude must be much higher than V c0 , or the pulselength very long (τ write 1µs). This is not practical and Error Correction Codes (ECCs) are employed to lower the WER to acceptable levels. Each additional bit of ECC reduces the error rate by 3-4 orders of magnitude but comes at the cost of speed and memory. We can express the conventional operation window for the write process in MRAM as:
Yet even with several bits of ECC it can be difficult to achieve sufficiently low WER and high endurance. Instead, we suggest that by operating at moderate WER for certain ANN applications we can dispense with ECC and at the same time reduce V write to raise N c by many orders of magnitude. We may express the error-resilient operation window for the write process in MRAM as:
Using the low-amplitude voltage values suggested by Eq. (5) can boost the endurance, speed and energy efficiency of MRAM.
B. An example of operating at moderate WER To demonstrate the benefit of operating at moderate WER we present an example using the formulas from Ref. [26] . We set the normalized energy barrier ∆ = 40, i.e. an approximate 1-year retention error of exp −[(1 year)/(1 ns)] exp(−∆) 6 × 10 −6 , the characteristic switching time τ D = 2ns, the switching voltage V c0 = 0.3V , and the breakdown voltage V bd = 1.2V . In Figure 5 we plot the voltage pulse amplitude and duration required for certain WER targets. 6   TABLE I . The second, third and forth rows correspond to the red, green and blue circles in Figure 5 and they show the difference in write voltage between error-free memory ([WER = 10 −6 ] + ECC ⇒ WER < 10 −15 ) and error-resilient designs at fixed pulse-length. The associated endurance gain in terms of number of cycles is estimated to exceed 10 10 . The bottom two rows show the benefits from error-resilient designs at a 20% reduced voltage (green/blue diamonds in Figure 5 ).
At a fixed pulse-length, relaxing the WER target significantly reduces the write voltage: at τ = 5τ D = 10ns the voltage (V WER=10 −6 = 2.78V c0 , red circle) drops by 36% when the WER target increases from 10 −6 to 10
(green circle) and by 46% when WER = 0.1 (blue circle). Such reduction in voltage amplitude enables virtually unlimited number of cycles (N c ). Therefore, we can trade back some of the endurance gain for shorter pulse-length. For a constant 20% reduction in voltage, i.e. 0.8V WER=10 −6 , we can calculate the pulse-length required from the WER = 0.01 and WER = 0.1 curves. This yields a 32% and 50% reduction in pulse-length for WER = 0.01 (green diamond) and WER = 0.1 (blue dia-mond) respectively, along with the 20% reduction in voltage amplitude. The comparison against WER = 10
assumes that a standard MRAM product would employ ECC to lower the WER from 10 −6 down to 10 −15 . For the proposed error-resilient operation window no ECC will be used. We summarize these results in Table I .
The improvement in energy efficiency when relaxing the WER target can be estimated from the reduction in voltage amplitude and pulse-length seen in Table I . At higher speed, i.e. lower τ , the energy savings from relaxing the WER target increase as the WER slopes in Figure 5 become steeper. This is particularly relevant if MRAM is to compete with and complement fast on-chip SRAM.
IV. CONCLUSIONS
Stochasticity is linked in a fundamental and yet not fully understood way to neural networks. At the same time it is an inherent property of MRAM that has hampered it for more than a decade. The convergence between these two technologies presents a unique opportunity for research and for improving the performance of many ANN applications.
To demonstrate this we studied the resilience of three binarized ConvNet architectures to errors in the binary activations during the training process. Several image datasets were examined and the degree of resilience varied significantly across the datasets and the network architectures. For the binarized LeNet-4 and NiN architectures trained on small-and moderate-complexity datasets we found a modest improvement of the error-free test accuracy when the networks were trained with BER of a few percent. The test accuracy gradually dropped when the BER was raised beyond a few percent. For the binarized AlexNet trained on the 1000-class ImageNet dataset we observed a slight degradation in the test accuracy for BER up to 2% followed by a precipitous drop for BER > 4%. However, when using subsets of the ImageNet with reduced number of classes, we observed increased error tolerance of the training process. This suggests that the semantic distance between classes is critical in determining the degree of error resilience. The depth and complexity of the network, as well as the number of training images, had no clearly identifiable effect to error resilience. Remarkably, for 2-class subsets of ImageNet, the binarized AlexNet architecture showed no degradation in test accuracy when the network was trained with BER up to 32%, with BER = 50% corresponding to completely random activations.
For MRAM, relaxing the WER targets enables massive improvement in endurance, along with substantially higher speed and energy efficiency. We concentrated the discussion on relaxing the WER because high MRAM endurance is necessary for training. For inference applications we can exploit read, read-disturb and retention errors to improve memory performance-especially for the weights-by increasing memory density and speed.
A more extensive study will include bit errors elsewhere in the system, most notably in the weights, and will allow different error rates for each type of variable. 
