Abstract-Resistive memories enable dramatic energy reductions for neural algorithms. We propose a general purpose neural architecture that can accelerate many different algorithms and determine the device properties that will be needed to run backpropagation on the neural architecture. To maintain high accuracy, the read noise standard deviation should be less than 5% of the weight range. The write noise standard deviation should be less than 0.4% of the weight range and up to 300% of a characteristic update (for the datasets tested). Asymmetric nonlinearities in the change in conductance vs pulse cause weight decay and significantly reduce the accuracy, while moderate symmetric nonlinearities do not have an effect. In order to allow for parallel reads and writes the write current should be less than 100 nA as well.
I. INTRODUCTION
Resistive memory crossbars can dramatically reduce the energy required to perform computations in neural algorithms by at least six orders of magnitude when compared to a conventional CPU [2] . For data intensive applications, the computational energy is dominated by moving data between the processor, SRAM (static random access memory), and DRAM (dynamic random access memory) [3] . New approaches based on memristor or resistive memory [4] [5] [6] [7] crossbars can enable the processing of large amounts of data by significantly reducing data movement, taking advantage of analog operations [8] [9] [10] [11] [12] , and fitting more memory on a single chip.
Resistive memories are essentially programmable two terminal resistors. If a write voltage is applied to the device, the resistance will increase or decrease based on the sign of the voltage, allowing the resistance to be programmed. At lower voltages, the state does not change. Consequently, these devices can be used to model a neural synapse wherein the resistance acts like a weight that modulates the voltage applied to it. This has resulted in a large interest in developing neuromorphic systems based on such devices [8] [9] [10] [11] . Ideally, the resistive memories would have a perfectly linear and controllable response allowing them to be programmed to any arbitrary analog value. Unfortunately, realistic devices have three key non-idealities: 1) read noise which causes the value read from the resistive memory to be different than the true value, 2) write noise which causes the value written to be different from the intended value, and 3) write nonlinearities which means that the change in conductance due to a write pulse will be different depending on the device's current state. This is true for most resistive memory devices, including metal oxide ReRAM, CBRAM, PCM and others. In this paper, we model the resilience of neural algorithms based on machine learning in the presence of device based noise and variability.
The key contribution of this paper is to determine an acceptable range of device operating parameters in order to guide the development of new resistive memory device technologies. Consequently, we only focus on the three aforementioned non-idealities introduced by individual resistive memory devices themselves, excluding other nonidealities that would be present in a full system. Other studies have examined some device non-idealities, but none have systematically analyzed all three effects [13] [14] [15] . Analyzing all the non-idealities in a full neuromorphic system will be the subject of future work.
In the next sections, we describe a general purpose neural architecture that can be used to accelerate many neural algorithms. Backpropagation was chosen for this study as it is a computation-intensive algorithm that underlies the This allows all the read operations, multiplication operations and sum operations to occur in a single step. A conventional architecture must perform these operations sequentially for each weight resulting in a higher energy and delay. A matrix-vector multiply can also be performed by driving the columns and reading the currents on the rows. (b) A parallel write is illustrated. Weight W ij is updated by x i ×y j . In order to achieve a multiplicative effect the x i are encoded in time while the y j are encoded in the height of a voltage pulse. The resistive memory will only train when x i is nonzero. The height of y j determines the strength of training when x i is nonzero. The column inputs y j can also be encoded in time as in [ successful application of neural networks in software and hardware, and has been thoroughly benchmarked in previous work [16, 17] . We use a numerical simulation, written in Python, to model how backpropagation performance is impacted by the three hardware non-idealities for different data sets, and determine device properties required to maintain high learning accuracy.
II. GENERAL PURPOSE NEURAL ARCHITECTURE
The key design considerations for an effective neural algorithm accelerator is that it should both reduce the computation energy by orders of magnitude, and it should be flexible enough that it can run many different neural algorithms. In [12] it is shown that a resistive memory crossbar can accelerate two key operations: 1) a parallel read, or vector matrix multiply, and 2) a parallel write or rank one outer product update, as illustrated in Fig. 1 . Many neural algorithms such as sparse coding, restricted Boltzmann machines, and backpropagation rely heavily on these two operations. The difference in the implementation of these algorithms is how the inputs and outputs of a crossbar are processed. An NxN crossbar accelerates O(N 2 ) operations, while it has O(N) inputs or outputs. This means that the energy to process an input or output can cost O(N) times more than the energy to read or write a single resistive memory element without significantly increasing the system energy. This key insight allows us to optimize the tradeoff between energy efficiency and system flexibility. A crossbar based neural core should be used to perform the parallel vector matrix multiply and outer product update, while a more general purpose digital core can be used to process the inputs and outputs of the crossbar. This is illustrated in Fig.  2 . The neural core is illustrated in Fig. 3 .
To represent both positive and negative matrix values with a resistive device, a reference weight is subtracted in analog following [18] . Alternatively, it is possible to take the difference between two resistive memory elements, but this requires twice as many devices [13] . The inputs and outputs to the neural cores are digital and so analog to digital (A/D) and digital to analog (D/A) converters will be needed. This is energetically expensive, but they are O(N) operations and Digital Core 
can therefore be the same order of magnitude as the energy needed to drive the crossbar [12] . If desired, additional energy efficiency can be traded off against algorithmic flexibility by performing analog neuron operations in the neural core, as in [10] . The implementation of backpropagation on the general purpose neural architecture is illustrated in Fig. 4 Data communication between cores should be built on an address event representation (AER) based spiking communication model [19, 20] . This limits data to only be sent when needed and allows a shared communication bus to be used. Using a routing network also allows arbitrary connections between the cores. We note the brain has an extremely dense connectivity where individual neurons in the cerebral cortex can receive roughly 10,000 input synapses from other neurons [21] . To achieve this, the brain takes full advantage of its 3D structure to minimize connection lengths. Currently, high performance CMOS is 2D or at most 2.5D and so it is impossible to hardwire the same number of connections. Consequently, a shared bus is needed to emulate the same connection density.
Overall, to obtain maximum energy efficiency, this type of a system assumes that a neural network has dense local connections that can be mapped to a crossbar and fewer global connections that need to be sent over the routing network. Computation is localized to the maximum extent possible, which minimizes the amount of high energy cost, longer range communications that are required. A dense local and sparse global connectivity is similar to how the brain is organized. If this is not the case for a given algorithm, a single column of a crossbar can be used in a specific read or write step to allow for maximum flexibility at higher energy cost.
III. IDEAL WEIGHT RANGES AND LEARNING RATE
To efficiently map a backpropagation network to hardware several algorithmic parameters need to be set. For a given dataset, we need the learning rate, number of epochs to train for, random weight initialization, sigmoid slope, and network size. Physical resistive memories also have a min and a max conductance. This means that the network has a min and max weight it can store. Choosing the weight range correctly is important to maximize the usage of the resistive memory's dynamic range.
In this paper we analyze three different data sets as summarized in Table I . First we consider a small image version (8x8 pixels) of handwritten digits from the "Optical Recognition of Handwritten Digits" dataset from [22] . Next we use MNIST, a large image version (28x28 pixels) of handwritten digits [23] . Finally, we use a Sandia file classification dataset in [24] (256 byte-pair statistical attributes to classify 9 file types). We use a simple two-layer network (one hidden layer) for each dataset with the network size indicated in Table I . For simplicity, we consider a single network configuration; more generally there may be a optimizable tradeoff between network size and noise.
We arbitrarily chose to use sigmoid neurons with a unity slope. Through the learning process the weights will be scaled up or down to match the sigmoid slope. The initial random weight range and learning rate need to be correctly chosen to enable optimal training. Following [25] , the weights should be initialized to a uniform distribution of Uniform [-r , r] where:
The fan-in is the number of inputs to a layer and fan-out is the number of outputs (For the first layer of MNIST the fan-in=784 and fan-out=300 from Table I ).
This sets an initial scale for the weights. Next we determine the learning rate and number of epochs needed to train. These could be dataset dependent, so we test several different learning rates for each dataset as illustrated in Fig.  5 . (All the training plots in this paper are based on accuracy for the test data set, after training on the training data set.) For all three datasets, 100 epochs and a learning rate of 0.1 works well. These values are used in the rest of the paper.
Next, we determine how to best map the weights to the device's conductance state. To do this we need to understand what range the weights ideally take in a noise-free, Table II . Since the physical devices have a limited conductance range, a limited algorithmic weight range is required. We want the smallest possible range that will not decrease the accuracy. A smaller weight range allows more of the resistive memory's dynamic range to be used, minimizing the impact of noise. The impact of training a neural network with a limited weight range is plotted in Fig. 7 . This plot indicates that the weights can be clipped to around 1.5× without losing significant final accuracy. As seen in Fig. 8 , clipping the weights causes the larger weights to saturate at the limits. The more aggressive clip range of 1.5 allows some weights to saturate, and maximizes the use of the numerical dynamic range. This minimizes the impact of noise caused by a real device and maximizes the information stored in a particular device. When the ideal weight range is not known a priori a larger range may be used, at the cost of increased impact from noise.
With a reasonable algorithmic weight range, we can map the ranges from Table II to physical conductance. Consider a normalized conductance scale where the maximum conductance is 1. The minimum normalized conductance will be given by 1/(on-off ratio)=G OFF /G ON . If G ON /G OFF =10, the minimum normalized conductance will be 0.1. This normalized 0.1 to 1 range needs to be mapped to the weight range required by the algorithm. Physically, a fixed bias of 0.55 is subtracted from each weight as illustrated in Fig. 3 . This gives weights in the range of [-0.45, 0.45]. Next, this is scaled up or down to match the algorithmic weight scale. In the digital cores in Fig. 2 , the digitized results from the neural core can be multiplied by an appropriate scale factor. In the rest of this paper, when possible, we define the device models relative to the total weight range so that all the results are independent of the scaling. For example, we define noise parameters as a percentage of the total weight range.
IV. READ NOISE
When reading a resistive memory there are three kinds of read noise that can change the current: thermal noise, 1/f noise and random telegraph noise (RTN) [26] [27] [28] [29] [30] [31] [32] . RTN, the noise from a single trap filling/emptying, is typically the dominant form of noise. It can depend on a few particular traps causing the current to oscillate between two states. Nevertheless, between write cycles, the distribution of the relevant traps will change and the conductance change due to those traps follows a Gaussian [26] . Consequently, we consider two models, 1) a RTN model where the conductance randomly toggles between one of two states, and 2) a simple Gaussian noise distribution. The Gaussian model is a more generic model that can also approximate the effects of thermal noise and 1/f noise. For both models, the impact of the noise, averaged over an entire read cycle, can simply be modelled as a perturbation to the device's conductance during each read. For the RTN model with a standard deviation, , we randomly add + or -to the conductance where is defined relative to the total conductance range:
where G is the conductance after applying read noise, G o is the actual conductance stored in the device, is the standard deviation, and RN is the dimensionless standard deviation of the noise normalized to the range of the conductance, G range .
For the Gaussian model, the noise is defined by:
where N is a normal distribution with standard deviation .
The classification accuracy of the two models is compared in Fig. 9 . The two models give nearly identical results. This is because the noise from many resistive memories is added together during a vector matrix multiply. The final accuracy after 100 epochs vs the normalized clipping range for the three different data sets. The clipping range for each layer is normalized to the standard deviation of the unclipped weights given in Table II . Normalized Weights Weight Probability Density By the central limit theorem, the noise sources will add together to form identical Gaussian distributions so long as the variance of the noise is the same. Consequently, for the following simulations it is sufficient to model RTN with Eq. (3) the Gaussian noise distribution.
In some situations can be a function of the device's current state, G o . For instance in [26] , is a constant for currents >1μA and then decreases as the current decreases below 1μA. Therefore, we also consider a proportional Gaussian model where is also directly proportional to the current state, G o :
Here is a normalization constant. If we choose so that both models have the same variance, they give nearly identical results, as illustrated in Figs. 10 & 11. This means that Eq. (3) is sufficient. In general, we propose Eq. (3) can be used to approximate any read noise distribution if the standard deviation in Eq. (3) is calibrated to give the same variance as a more complicated noise model.
To find we can approximate the clipped weight distribution as a uniform distribution over the weight range, and compare the variance between Eq. (3) and (4):
Solving for gives:
As seen from Fig. 8 , a uniform weight distribution is a reasonable approximation to first order. Using G max =1 and G min =0.1 gives = 1.8.
In Fig. 10 , we show how the read noise affects classification accuracy using a set of pre-trained weights (noise was not present during training). The accuracy starts to drop off after RN = 5% of the weight range. In Fig. 11(a) , we show a similar accuracy if we also apply the read noise during training. Now each data point corresponds to the final test set accuracy after a full training run of 100 epochs. In this case, the noise causes a significant variability in the final accuracy between runs which vary only in the initial random number seed. Training a network three times and taking the best result reduces the variability as shown in Fig. 11(b) . Using a fixed random number seed for both the weight initialization and the noise also eliminates the variability.
V. WRITE NOISE
Write noise occurs every time the state of a resistive memory changes [13, 33, 34] . The relative magnitude of this write noise is typically greater than that of read noise. Writing a resistive memory involves moving atoms around, which is inherently a stochastic process. The exact nature of the noise will be strongly dependent on the type of resistive memory and whether it is filamentary or non-filamentary. In general, the statistics of this type of noise are not well characterized. One of the most relevant measurements is reported in [13] , where a write pulse is applied to a phase change memory at a given conductance and the change in conductance, G, is measured multiple times to obtain a distribution. This work indicates that the noise increases as both G increases and as the initial state, G o , increases.
Since write noise is not well characterized experimentally, we consider the impact of two different models. First consider a write noise that is independent of the intended state change, G. Since the noise is independent of G, long and short write pulses will have the same noise distribution. Consequently, noise of this type would still have the same distribution after multiple write pulses. This means that the noise only affects the value that is read, and not the internal state. If the internal state has changed, the noise would compound for each step and would depend on G. Effectively, this can be modelled as a read noise and a separate write noise model is not needed. Since backpropagation alternates read and write cycles and the noise is summed over many devices during a vector matrix multiply, the read noise model in the previous section is sufficient to model this type of noise.
Next, consider noise that depends on the size of the update. Here, we make an important simplifying assumption that if we use one pulse or two pulses to change the conductance from one state to another, the noise will be the same as long as the average initial and final states are the same. This implies that one longer or higher voltage pulse will cause the same physical change as two sequential pulses that end at the same average conductance. Implicitly, we are assuming that the resistive memory can be modeled with a single internal state variable so that it does not matter how we end up at a given conductance state, which is typical for these devices [35] [36] [37] . Certain devices under particular operating conditions require two internal state variables to be modelled; this is beyond the scope of this work [38] .
Since the noise is the same for a given G regardless of the number of pulses required to get a given G, G must be proportional to the variance of the noise, 2 . After multiple pulses, the variance of the noise in each pulse is additive. Therefore, G Δ ∝ σ . We also assume that the write noise follows a Gaussian distribution as it is the result of the collective motion of many atoms so that the state after an update is given by:
Like the read noise, it is possible that depends on the initial state, G o , (and/or the final state. For simplicity, we consider smaller updates so that we only need to consider the initial or only the final state dependence). To understand the impact of the initial state dependence, we consider three possible models. First, is independent of G o:
Here we add the conductance range, G range , so that a dimensionless standard deviation, WN , can be defined. Next we consider a that is proportional to
As with the read noise, is chosen so that both models have the same variance. Therefore, is given by Eq. 6 and is 1.8 for a uniform weight distribution with G max =1 and G min =0.1. Lastly, we consider inversely proportional to G o :
Once again is chosen so that all models have the same variance. For uniform weight distribution with G max =1 and G min =0.1, is 0.35.
The effect of the three different models of write noise is plotted in Fig. 12 . Again the final accuracy after training for 100 epochs is plotted. We see that the particular G o dependence does not have a significant effect on the write noise for the small and large images, but matters somewhat for the cyber dataset. To first order, the simplest model independent of G o , Eq. 8, gives a reasonable intuition of how the system responds when the exact G o dependence is not known. The key is that the noise model should have the correct noise variance.
In order to compare the magnitude of updates with the corresponding write noise, we plot the probability density of the updates without noise and the corresponding noise sigma in Fig. 13 for small images. We choose a safe noise sigma, WN , of 0.1 that does not affect the accuracy as shown in Fig. 12. (Only the second layer of the network is shown for simplicity, but results are nearly identical for the first layer.) The updates are on the order of 0.01% to 0.5% of the weight range. We can define a characteristic update size as a weighted average of the update size, weighted by the update size itself. This captures the fact that it requires ten 0.1% updates to train as much as a single 1% update. The characteristic update sizes are 0.07%, 0.18%, and 0.17% for the small image, large image and file type datasets respectively. For the characteristic update and WN = 0.1, the noise is 0.3% to 0.4% of the total range. Surprisingly, this safe noise level is 2.4X to 3.8X larger than the characteristic update itself! For smaller updates, the noise can be more than 20X the size of the update, indicating that smaller updates likely do not contribute as much to the overall learning.
VI. WRITE NONLINEARITY

A. Asymmetric Nonlinearity
In addition to write noise between cycles, the physics of resistance change in resistive memories typically causes the conductance change to depend on the resistive memory's current state [14] . Often this nonlinearity is asymmetric with regard to the direction of the pulse. For example, near the maximum conductance a given pulse will not significantly increase the conductance, but it can significantly decrease the conductance. This is particularly true for filamentary devices, due to a thermal-runaway effect. In order to maximize efficiency, a parallel open loop write scheme must be used, and therefore we do not know each individual resistive memory's current state between training examples. This means that the same sized pulse must be applied regardless of the device's state and the nonlinear response thus introduces an additional "error" in the write. Following [14] , the conductance, G, as a function of the normalized pulse number, p, for increasing pulses is modeled by:
G min is the minimum conductance, G max is the maximum conductance and is a parameter characterizing the nonlinearity. When =0, the response is perfectly linear. Experimental devices have been demonstrated with 2 -5 [14] . If we have a target update, G target , using Eq. (11) we can solve for the actual update:
where the normalized target update is given by:
For decreasing pulses, the conductance is given by: (15) and the actual update given the target update is:
The asymmetric nonlinearity model, Eqs. (11) and (15) are plotted in Fig. 14. A strong asymmetric nonlinearity causes the conductance to decay towards a center value after alternating pulses as illustrated in Fig. 15 . A small amount of weight decay can be beneficial to prevent overfitting, but typically the decay will be too large, degrading the ability of devices to "learn" the weights needed for the backpropagation algorithm, reducing its final accuracy. This is illustrated in Fig. 16 . The accuracy (after 100 epochs) vs nonlinearity is plotted for the three different data sets.
B. Symmetric Nonlinearity
Some devices exhibit a symmetric nonlinearity. This is demonstrated in [39] for a Ag/GeSe/Pt CBRAM cell, and the data is replotted on a conductance axis in Fig. 17 . Resistive memories that have a non-filamentary switching mechanism are also expected to behave with a symmetric switching response, although this has not yet been explicitly demonstrated in the literature at this time. The symmetry is more likely because the conductance modulation is dependent on the motion of many atoms, rather than a few critical atoms in a filament. To understand the impact of a symmetric non-linearity, we consider a simple sigmoid based model illustrated in Fig. 18 . We assume that after a sufficient Fig. 15 : Applying identical alternating positive and negative pulses causes the weight to decay towards a center value when it should remain constant. When the weight is near the maximum, a positive pulse does not change the weight much, but a negative pulse significantly decreases it. The opposite holds for weights near the minimum weight. Fig. 16 : The impact of the asymmetric write nonlinearity on learning is illustrated for all three datasets.
Asymmetric Nonlinearity
Fig. 14: Asymmetric write nonlinearity is illustrated. As the nonlinearity, , increases the amount written depends strongly on the current state. The conductance rapidly changes at low conductance and then saturates at higher conductance for positive pulses; the converse occurs for negative pulses. The x axis is normalized by the number of pulses needed to go from the minimum to the maximum conductance.
number of pulses the device will saturate at a maximum or minimum conductance. (The pulsing measurement in Fig. 17 was likely stopped before the conductance started to saturate). The conductance, G, as a function of the normalized pulse number, p is given by: G min is the minimum conductance, G max is the maximum conductance and is a parameter characterizing the nonlinearity.
is defined such that the symmetric and asymmetric models have the same slope at the center conductance: (G min + G max )/2. the actual update given the target update is:
where is defined by Eq. (14) . A symmetric nonlinearity model does not suffer from the same weight decay problem as the asymmetric nonlinearity. Consequently, a much larger nonlinearity can be tolerated without decreasing the accuracy as illustrated in Fig. 19 as compared to Fig. 16 .
VII. COMBINED NON-IDEALITIES
Finally, we compare the impact of all the non-idealities operating at the same time. In Fig. 20 we show the effect of read and write noise with different nonlinearities for the small images. Figs. 21 and 22 show the same for large images and file types respectively. Each colored "pixel" in each sub-figure represents a final accuracy after training for 100 epochs. The largest data set (MNIST) required 2-3 days of CPU time on a single core to train with all three nonidealities enabled for a single set of parameters. Because pixels represent independent runs, we used up to 1024 cores of a parallel cluster to scan multiple parameters and produce the data for For the read noise we used a Gaussian noise model with a fixed sigma, Eq. (3). For the write noise, we used the simplest model, with noise independent of the current state, Eq. (8) . As seen from the figures, adding an asymmetric nonlinearity response rapidly reduces the overall accuracy. Moderate symmetric nonlinearities do not impact the accuracy. For small images, Fig. 20d , the symmetric nonlinearity actually increases the accuracy at higher levels of noise. We believe this is because the weights are nudged towards the max or min values, reducing the impact of noise.
VIII. DEVICE RESISTANCE
The last key device requirement to consider is the resistance required for use in a crossbar. Scaled wires at a 10nm half pitch can only handle 10 μA before electromigration becomes an issue [40] . Higher currents also cause unacceptable parasitic voltage drops [41] . In order to support a 1000x1000 crossbar with a fully parallel read or write, each device can have no more than a maximum switching current of 10nA. If we only read/write a smaller 100x100 crossbar in parallel, each device can have a switching current of 100nA. At 1V that corresponds to a resistance of 10 M .
High resistance devices have been demonstrated [42, 43] , but devices have not yet been demonstrated with both a high resistance and low variability, symmetric analog switching characteristics. The need for high on-state resistance and good analog characteristics means that filamentary resistive memories may not work as well as non-filamentary devices. A resistance higher than a quantum of conductance, 13 k , requires current to tunnel through barrier. This presents a fundamental problem for a filamentary device: a single atom [39] . This device shows a nearly symmetric response when switching from positive to negative pulses. 
IX. CONCLUSION
We have introduced a general purpose neural architecture that can solve many different problems. This architecture can be used effectively to implement backpropagation with resistive memory crossbars that have the properties summarized in Table III . Our numerical modeling of 2D crossbars (matrices) of devices on three different datasets has shown that training or classifying with resistive memories with a read noise sigma up to 5% of the total conductance range does not significantly degrade the accuracy (~1%). Neural networks are also robust to write noise that is up to 0.4% of the total range and 300% of a characteristic update. This will vary slightly depending on the dataset and the neural network architecture. Both read and write noise can be modelled with reasonable accuracy using a simple Gaussian noise model. The simpler noise models generally match more complex models so long as the models have the same noise variance averaged over the weight distribution. The read and write noise models are physically inspired, but refined to empirically fit the available data. Asymmetric nonlinearities with >0.1 degrades the classification accuracy as it causes weight decay, while moderate symmetric nonlinearities with up to 5 do not harm the classification accuracy. The asymmetric nonlinearity model is empirically derived from device data, while the symmetric nonlinearity model is more speculative. To work in an energy efficient crossbar, resistive memories must also have a high on-state resistance of 10 M or higher. Promising devices have been demonstrated experimentally, but more resistive memory development is needed to create a device that meets all of these requirements simultaneously. 
