Abstract--Some next generation computing devices may consist of resistive memory arranged as a crossbar. Currently, the dominant approach is to use crossbars as the weight matrix of a neural network, and to use learning algorithms that require small incremental weight updates, such as gradient descent (for example Backpropagation). Using real-world measurements, we demonstrate that resistive memory devices are unlikely to support such learning methods. As an alternative, we offer a random search algorithm tailored to the measured characteristics of our devices.
I. INTRODUCTION
Future applications can no longer rely on the various types of CMOS scaling to gain more computing power [1] . The industry is actively seeking other paths to go "Beyond Moore". One such path is neuromorphic computing, where the hardware is specialized to run neural networks of some variety.
An attractive approach is to arrange memristors in a cross bar to do analog matrix-vector multiplication. The DARPA SyNAPSE project invested in such an approach [2] , [3] , though several of their performers moved away from analog compu tation. The IBM TrueNorth chip is in fact entirely digital [4] , [5] . Hewlett-Packard focused on a software architecture that runs on conventional CMOS [6] . This suggests how difficult it is to implement a neural algorithm using memristors. We show below that some of this difficulty stems from the limited precision of analog computation.
The emphasis in the field has been on devices that perform efficiently, as opposed to devices that learn. For example, TrueNorth packs 10 6 integrate-and-fire units on a chip in part because it does not include the hardware support for online learning. All training must be done in a conventional com puter. Resistive memory devices ("memristors") are attractive because of their potential for low-cost online learning.
Highly adaptive behaviors-that which we call "intelligence" in humans-require online learning by definition. New appli cations which communicate more naturally with humans and help them solve problems will need to be adaptive rather than static [7] . Certainly conventional computers can do this, but to scale up will require new generations of efficient hardware.
At present no such hardware is in hand, and there should be a clear advantage before developing it. We explored whether neural learning methods would indeed benefit from memristor crossbars, and which algorithms would work best. Below we detail the journey. We first characterized the devices (Section II), then developed a simulation to train them using conventional methods (Section III). When this proved im-978-1-4799-1959-8/15/$31.00 @2015 IEEE possible (Section IV) we developed a random-walk approach (Section V).
II. MEMRISTOR MODELING
A simulation of neural algorithm running on a memristor crossbar consists of three levels: the device, the crossbar, and the algorithm. We detail the latter two in Section III. Here we focus on modeling the memristor itself.
Analytic models of memristors based on first principles are an active area of research [8] , [9] . This difficult problem is not yet fully solved. In particular, analytic models tend to underestimate the variability of a given device ' s behavior. Thus we opted for a data-driven approach: collect enough measurements in a table and use nearest-neighbor lookup and interpolation to emulate the physical device.
We have several tantalum oxide (TaOx) test chips produced by HP and Sandia. HP also developed a programmable test board, which they kindly provided to us. (Only a small number of these boards have been produced, primarily for internal use.) The board consists of a PIC microcontroller, digital-to-analog and analog-to-digital converters, and an array of electronic switches to select row and column on the chip.
Barring any hidden state, a memristor can be viewed as a function f(Ro, V) ---7 R1, where Ro is pre-pulse resistance, Rl is post-pulse resistance, and V is the pulse voltage. All pulses in our tests were 2JL3 wide. The goal of the data collection was to sample this function.
A. Controlled start values
The first challenge was to move the device into a known initial resistance Ro. Memristors are quite noisy, both in to read and to write. It is generally not possible to set one within a small enough tolerance in a single pulse. Closed loop control is necessary. We programmed the HP board to apply a series of pulses, measuring the resistance after each one and adjusting the sign and magnitude to track toward the target Ro. Our naive implementation used four cycles with a maximum of 30 pulses each. All the voltage pulses in a given cycle used the same sign, which was determined by the sign of the difference between measured resistance and Ro. At the start of the cycle the magnitude of the voltage was small, then gradually increased until the measured resistance passed Ro, or 30 pulses were completed. The standard deviation of resistance error with this procedure was 6.2Sl.
The test consisted of a resistance read, a specific voltage pulse, and a final read. We wrapped this in a pair of loops that systematically explored a region of (Ro, V) space.
B. Random-pulse sampling
The above procedure is inefficient. Setting the pre-pulse resistance involves a large number of reads and pulses in itself. Those reads and pulses are also perfectly valid data. We developed a simplified procedure that interleaved pulses and � reads: R V R V R V R ... The resistance readings immediately ':l3 before and after a voltage pulse V provide the Ro and Rl values, respectively.
In order to cover a large portion of the parameter space (Ro, V), the test program drew voltages from a uniform distri bution over [-7,4.3] . However, the domain of the distribution changed when the resistance dipped too low or rose too high. This was necessary to protect the device from damage and to concentrate the samples within a useful ranges of resistances.
Using this method we collected several million samples from a small handful of devices. The collect time per device was under 24 hours. A 10 megasample collect from one device forms the basis for the simulations below. The other devices gave qualitatively similar results, thus we consider this data representative. Figure 1 shows a subset of the collected data points, decimated by a factor of 100 for ease of viewing.
C. Table construction
The primary use of the data is to emulate a memristor in a non-parametric manner. One option is to find the nearest neighbor in (ROl V) space and use the third member of the tuple as the resulting resistance. That approach would be slow and overly sensitive to input values. A better approach is to extract statistics from the data and store them in a table that allows direct indexing.
We constructed a table that ranged from -10V to 10V in 0.1 V increments. Resistance ranged from on to 20kn in lOOn increments. We distributed each sample to its four nearest bins in (ROl V) space with bilinear weighting. For each sample we calculated 6. = Rl -Ro, and for each bin summed weight, weight· 6. and weight· 6.2. From these we calculated the mean and standard deviation of 6. for each bin.
The resulting table still required a number of post processing steps to be useful for simulation. First we de termined the write voltage threshold to be ±1.6V by ex amination. Vo ltages near zero showed a small amount of resistance increase, even though in theory they should produce no change at all. The cause of this is undetermined, but possibilities include an actual increase in resistance due to the read protocol of the HP board. The lower-level software samples the resistance many times, and we have observed some cumulative effect from sub-threshold pulses. We forced these entries to be zero, and subtracted their average from other entries in the same row (associated with a lOOn-wide bin), effectively removing the read drift. Figure 2 shows the resulting table.
We applied a similar procedure to the standard deviation values. There were nonzero entries in cells below the write threshold, which we interpreted as read noise. We subtracted the average read noise on a per-row basis, and recorded it in another table for use during simulation. We interpreted the remaining standard deviation values as write noise. Figure 3 shows the resulting table.
Note that these two tables (JL6., a 6.) do not fully capture the behavior of the device. We subjected the data to several analyses outside the scope of this paper (see [9] for related work). We observed that Rl is not strictly a function of Ro and V, but also the history of previous voltage pulses. That is, the memristor contains hidden state. The table-based approach treats the hidden state as noise, which is sufficient for this work.
D. Results
The 6. (resistance change) values showed some noteworthy patterns. Unlike the typical model of a memristor as a sym metric pair of exponentials, we found significant asymmetry. to posItIve voltages. We expect this from a circuit-based argument. When a memristor is subject to a voltage sufficient to cause a decrease in resistance, the current will increase according to Ohm ' s law, creating a feedback loop that drives resistance down even faster.
Read and write noise varied across the space. Focusing on the regions that were useful for our simulations, read noise (one cycle) was about 1.5% of the resistance value. Write noise for positive pulses (decreasing resistance) was about 10% of Ro. Write noise for negative pulses was about 0.5% of Ro.
With the finished table, we simulated a pulse to the mem ristor by retrieving the four cells closest to (Ro, V) and interpolating between them. The new resistance resulting from the pulse was Rl = Ro +� +G(ow)· Nw, where G(·) is the Gaussian distribution with zero mean, ow is the interpolated standard deviation, and Nw is an optional scaling for write noise. Similarly, we simulated a noisy read as RO+G(aR)·NR, where N R is an optional scaling for read noise.
This model approximates the function f(Ro, V) -+ Rl with a piecewise linear (ruled) surface. In rapidly changing regions such as near the "cliff ' , the model introduces systematic error due to large differences between the actual slope and straight line fit. Sometimes this difference exceeds 1000. One solution would be to use finer binning.
III. TRAINING A MULTI-LAY ER PERCEPTRON WITH BACKPROPAGATION
For clarity, we give a very brief summary of multi-layer per ceptrons (MLPs) and the Backpropagation algorithm (Back prop). One stage of an MLP has the form of a matrix-vector multiply followed by a differentiable squashing function: 
where T is the "truth" or "target" vector for the network output, Ei is the error feedback to a given layer, and 0 is elementwise multiplication. Each layer is updated using a step size ex:
The step size is typically a positive constant much smaller than 1. Alternately, a line search can find the step size that gives the largest improvement in each cycle. We used a constant ex = 0.01 in all the tests reported below.
A. Crossbar implementation
A melmistor crossbar (abbreviated "xbar") could be part of an electronic module that does all the work of one perceptron layer ( Figure 5 ). These modules could be chained together to form an MLP ( Figure 6) .
A module may be implemented in several different ways. Figure 7 shows a highly abbreviated xbar. In addition to a dense 2D array of mernristors, a certain amount of peripheral circuity would be needed to drive the devices and perform the functions described in Figure 5 . In the design shown here, each perceptron weight is represented by one memristor device. In order to have signed weights, there is an additional row of conventional (non-programmable, low noise) bias resistors. These produce a bias current which is fed to each row circuit. The rows and columns also have driver circuits which pulse the array to program the mernristors.
Omitting the electronic details, an output circuit contains two key resistors that determine the operating range of the memristor devices. Rbias is the fixed conventional resistor that determines the value of a "zero" weight. Rout is an output resistor that scales current to voltage. A weight w translates to a conductance (inverse of resistance) in this system, which has the following relationship with its associated memristor value R:
Ultimately, R contributes to a voltage that represents the linear (non-squashed) output of the perceptron. This passes through some electronic implementation of the sigmoid func tion. In our simulations Rbias = 4kSl, as this is roughly in the center of the cleanest data in the table. The scale factor was Rout = 40kSl, which gives us weights in -3.3 to 10 for resistances in 6kSl to 2kSl respectively. We observed that conventional MLPs trained on the problems presented here typically had weights in [-3,3] ' so this configuration was sufficient.
B. Learning on an xbar
Each training cycle produces a set of weight changes (Equa tion 1). Learning would be most efficient if there existed a set of row and column voltages that could fully update the xbar in a single pulse. Unfortunately, the asymmetry in mernristor response ( Figure 4 ) prevents a single pulse that correctly delivers both positive and negative changes. Furthermore, the shape of the response curve does not allow the set of column voltages suitable for one row to work with a different row. Thus, the best we can do is pulse the array row-by-row. Alternately, it may be possible to pulse the whole array using voltages right at the programming threshold (in our case ±1.6V) as described in [10] . The idea is to rely on a large number of small pulses that add up in the correct direction for each weight. Unfortunately, this approach failed to converge in our test setup.
As a consequence of Backprop itself, both pulsing methods are forced to change the resistances by amounts that are orders of magnitude smaller than the observed noise in memristor devices. We explore this in Section IV.
We developed software that implemented weights in a modular way, so that the table-lookup memristor emulation could easily replace conventional floating-point weights. This allowed us to validate network structure and performance before testing mernristor behavior. The software was not intended to be a full electrical simulation, simply a means to explore the feasibility of these algorithms on potential future hardware.
C. Datasets and Results
We tested using the following datasets: 1) The MNIST handwritten digit set, as preprocessed by LeCun et al. [11] . This set has 28 x 28 pixel input cells. The pre-selected training set has 60,000 items, and the test set has 10,000 items. 2) The "Optical Recognition of Handwritten Digits" dataset from the UCI repository [12] . This set has 8 x 8 pixel input cells. The pre-selected training set has 3, 823 items, and the test set has 1,797 items. 3) A Boolean function with 4 bits of input and one output bit. The Boolean data consisted of all 16 possible inputs plus their ground-truth output. We used the full set for both training and testing, and required that the resulting network achieve 100% accuracy. Training had three possible stop conditions. Define an "epoch" as one pass through all the training data. We could set a limit Tepoch on the number of epochs. We could limit the maximum error on any element of the output vector, that is, IIE l loo ::; Tmax. Finally, we could limit the average error, 11�l l ::; Tavg, where n is the number of elements in E. Note that E is different than the actual classification, and that classification accuracy is measured on the test set, which may be different than the train set. Thus, Tavg and T max are indirect predictors of classification accuracy.
We trained conventional (float) MLPs with 300 hidden units on dataset 1. It took 695 epochs to satisfy Tavg = 0.1, resulting in an accuracy of 96.5%. See Table I for a full listing of results and their statistics. Note that when a (J column is blank, the column to its left is exact. We did not attempt to fine-tune the network for better accuracy, because the goal was to test the feasibility of a memristor-based network, not to set a new classification record. We converted this network to resistance values using Equation 2. Without any noise, the resulting memristor simu lation achieved the same level of accuracy as the float version. As noise increased, the accuracy dropped off (Figure 8) . At a realistic level (NR = 1) it fell to 37.8%.
We trained conventional MLPs with 36 hidden units on dataset 2. The only stop condition was Tepoch = 1,000. The resulting network achieved 96.6% accuracy. A mernristor simulation with converted weights and no noise achieved the same level of accuracy as the float version. With N R = 1 the accuracy of the converted network dropped to 90% (Figure 9) .
Because of the higher cost of simulation (about lOx to 20x the cost of the float version, depending on the noise model), we only attempted xbar training on the smaller datasets. We trained on dataset 2 using the row-by-row pulsing method, NR = 1 and Nw = 1. The network achieved 80.9% accuracy after 1,000 epochs.
Dataset 3 was more difficult. An MLP with one hidden layer of sufficient size should be able to learn any Boolean function perfectly. One way to enforce this is to train until all output errors fall below the binary decision threshold (which in this case is 0.5). We trained a conventional network with Tmax = 0.4. It converged in 11,405 epochs, with 100% accuracy. We converted the learned network to an xbar. Without noise it attained 100% accuracy. With N R = 1 it only fell to 99.6%
( Figure 10 ). We attempted to train the network directly on an xbar with zero noise, and it repeatedly failed to reach the required 100% accuracy. When trained with the observed level of noise (NR = 1 and Nw = 1) it did about 20% better. This is consistent with the observation that some noise in neural network training can improve the result (for example [13] ).
IV. THE PROBLEM OF PRECISION
Our tests showed that the emulated memristor crossbar with a realistic level of noise trained to a lower level of performance than the equivalent conventional ("float") network. In partic ular, they were incapable of reaching 100% accuracy on the simple Boolean function of dataset 3.
It is well known that noise places a lower bound on preci sion. We argue here that the precision necessary for Backprop to converge is finer than that allowed by the observed noise of the device. First we examine the precision used by the conventional network.
An IEEE 754 single-precision float has a 23-bit mantissa, the equivalent of about 6 decimal digits. A cursory analysis of Backprop shows that this is just enough. Specifically, suppose a = 0.01, 1= 1, and WI = 0 so 1'(WI) = 1'(0) = 0.25. Since the typical range of network weights is [-3,3] ' we assume no more than 5 significant digits after the decimal point. This implies that e < 0.001 will be lost in the addition operation. Late in training, observed values of e are on the order of 0.1 to 0.01, so single-precision floats are sufficient for our purposes. See [14] for a more detailed analysis of precision required by Backprop. However, the xbar emulation brings more loss of precision. Weights are represented by resistances. From Equation 2 it is easy to derive the relationship between weight change and resistance change: 10 4 , so such updates are right at the limit of single-precision floats. Furthermore, they evaluate to voltage pulses that are only a few microvolts above threshold, again straining the limits of single-precision. We could improve the simulation by increasing to double-precision, but all this would accomplish is to more accurately compute the noise that overwhelms the algorithm. It is extremely unlikely that a practical electronic imple mentation of an xbar will support microvolt pulses or sub Ohm resistance changes. Even under closed-loop control, the smallest step is on the order of lOn, two orders of magnitude larger than the needed value. The fact that Backprop leads naturally to such absurd scales suggests that it is not the appropriate algorithm.
Substituting these into Equation 1 gives:
Consider that we observed ow ;::: :; 400n in the "cliff' area, four orders of magnitude larger than the needed resistance step size. If the goal were simply to train the network offline and then burn it to an xbar, then this would be acceptable, as we could use closed-loop control to set the resistances. The only barrier to practical use would be read noise. The observed O"R ;::: :; 60n degraded the performance in our simulations by various degrees. It may be possible to mitigate this with more redundant network structures. This is the subject of ongoing work.
V. TRAINING WITH LOTTO However, the goal is to use an xbar to learn directly and efficiently. We may rely on the CPU for some tasks, but the sum of the work done by the xbar+CPU should be substantially less than the equivalent algorithm done entirely on the CPU. Otherwise there is no value in hardware-based learning.
We look at the characteristics of the system, and try to take advantage of them. We should minimize expensive operations, and make cheap operations do as much work as possible. The xbar operations, in descending order of time cost, are:
• Load data from CPU. -Assume the use of direct memory access (DMA) to cycle through blocks of training or test data in conventional memory. This will be slightly more efficient than fetching them to the CPU core, but similar enough that we should ignore the difference. The key constraint is that the xbar-based learning algorithm not use substantially more fetches than a CPU-based algorithm.
• Set a resistance with closed-loop control. -The number of iterations, and thus the cost, scales inversely with the accuracy bound E.
• Matrix-Vector multiply.
• Write pulse to entire xbar.
Memristors have an asymmetric response to voltage pulses. Raising the resistance involves low noise and low slope, and thus is easy to control. Lowering resistance involves high noise and high slope. It makes sense to think of the device as controllable in one direction, with reset in the other [15] .
A. Algorithm
The learning method presented here (Algorithm 1) is whim sically named "Lazy Optimization Through Targeted Over sampling" (LOTTO). It is a variant of random search [16] specifically adapted to the characteristics of memristors. The key idea is to incorporate write noise as a central part of the algorithm. The "targeted" aspect refers to the fact that sometimes it is necessary to use closed-loop control to set weights, but if we relax E we reduce the cost and also get random sampling around the desired weight.
The algorithm alternates exploration and exploitation steps until convergence. An exploration step picks a new random point in the weight space. An exploitation step goes back to the best point seen so far and samples a random location near it. In either case, LOTTO then iterates through the training data (a full epoch) to measure the goodness of the new point. If it is an improvement, LOTTO accepts the new point, then takes a small random step in the direction of increasing resistance. As long as each new point is an improvement, LOTTO continues the random walk (Figure 11) .
A random set of weights (exploration step) can be created in a manner similar to the random-pulse sampling method described in Section II-B. The small steps of the random walk are made by pulsing the entire xbar with a negative voltage at the write threshold. Due to the inherent randomness, all resistances will increase by small but varying amounts. Apply small negative pulse to whole xbar 
B. Results
We implemented LOTTO within the same framework de scribed earlier and tested it on dataset 3. We set Tepoch = 100,000 and Tmax = 0.4. Table I separates runs that con verged due T max from those that used up the full allotment of epochs. The latter are labeled "not converged".
LOTTO is sensitive to read noise, as it needs to accurately measure the quality of points in weight-space. This becomes more problematic with small numbers of data, such as the 16 items in dataset 3. We trained using full write noise (Nw = 1), but relaxed the read noise in two different ways.
First, we trained with N R = O. The average number of epochs (14,162) was competitive with conventional Backprop (11, 310). Second, we tested with N R = 1 but overs amp led by 100 x. This approach was substantially more expensive and less accurate. However, even in the cases where it did not converge, its average accuracy (76. 6% ) was competitive with Backprop running on an xbar with full noise (79.7%). Note that the reported number of epochs for the "not converged" case is slightly greater than 100,000 due implementation details.
C. Discussion
Clearly, LOTTO is not a satisfactory algorithm. A random search provides no guarantee of time bound. In fact, it provides no guarantee of convergence, even to a local minimum. Its chances are improved with lower-dimensional or highly degenerate parameter spaces. Thus, LOTTO is less likely to be useful for larger networks. A simple improvement in the current scheme would be to run Backprop until it stalls, then switch to LOTTO to finish converging.
The ideal memristor learning algorithm would work with resistance changes that are no smaller than the device noise levels, and it would measure its fitness online (that is, using only one or a few current data). There is plenty of space to explore for better algorithms. Categorically these are likely to be non-gradient-based and non-supervised. Examples include self-organizing maps [17] , winner-take-all networks [18] and sparse-code learners [19] .
VI. CONCLUSION
We explored the feasibility of direct traInIng of shallow MLP networks using memristor crossbars. We characterized several real memristors and used a non-parametric model of one as an emulator. This demonstrated some promise for hardware-accelerated neural networks. However, noise and asymmetry in the device's response to programming make direct training problematic. We offered a random search approach that, while not ideal in many respects, shows the possibility that hardware accelerators may eventually compete with CPU-based learning.
