We describe an approximation to backpropagation algorithm for training deep neural networks, which is designed to work with synapses implemented with memristors. The key idea is to represent the values of both the input signal and the backpropagated delta value with a series of pulses that trigger multiple positive or negative updates of the synaptic weight, and to use the min operation instead of the product of the two signals. In computational simulations, we show that the proposed approximation to backpropagation is well converged and may be suitable for memristor implementations of multilayer neural networks.
Introduction
Recent advances in machine learning have provided the solutions to many problems, which seemed insurmountable in the past. New approaches based on deep and recurrent artificial neural networks (ANN) are very efficient in dealing with pattern recognition, visual objects detection, speech recognition, signal restoration and prediction, etc. [1] . Deep feed forward networks excel at these tasks because their design can be general enough to solve many natural classification tasks, but their main feature is a funnel-like error landscape, which opens a possibility to apply computationally efficient learning methods, such as stochastic gradient descent.
Nowadays, the computations with ANNs are performed mostly using conventional computation architectures by simulation on general purpose processors or graphics cards which are quite efficient at matrix algebra being the core of feed forward deep networks. However, due to the limited level of parallelism this approach is quite inefficient in terms of the ratio between the computational speed and the power consumption. Consequently, it hinders the application of these methods in areas, where the available power is limited, such as mobile devices or autonomous robotic platforms. A further step in applying deep learning neural networks to real-life problems would depend on their implementation in hardware which would use accelerators specifically built to perform neural computations and employ the most possible parallelism. The optional solution here is to create a system which mimics the structure of the network under consideration by implementing neurons and their interconnections directly in hardware. In this case, the computational fabric of the device can be made of blocks representing neurons, which are interconnected with synapses with weights implemented as distributed memory, similar to the biological neural tissue.
One of the major problems hindering the design of artificial neural network hardware is the storage of synaptic weights. The distributed approach to the storage makes use of dynamic RAM (DRAM) very inconvenient due to the requirement of constant refresh operations arising from the leakage of charge [2] .
Among the approaches existing today, the one coming close to take full advantage of huge parallelism of ANNs is an FPGA-based implementation, such as described in [3] . Such implementations rely on the array of specialized processing units, tailored to compute outputs of neural network layer. Being quite efficient, this approach, however, suffers from two drawbacks. Firstly, the 2 distributed RAM on modern FPGA is implemented using static RAM arrays, which have low density, making the available memory limited. Hence, in this case, the implementation of large-scale deep network might require constant exchange with the external DRAM, which is inconvenient. Secondly, despite being quite universal, FPGAs suffer from low connectivity and limited availability of multiply-and-accumulate modules, and therefore they must be shared among different neurons. Moreover, the incorporation of on-line learning in such devices generally requires supervision from the external CPU.
Over the last several years, a variety of memristive devices has been discovered.
These devices are the resistors with the conductance controlled by the current or voltage previously applied to them. In this way, they represent the class of non-volatile memory devices with effective continuum of memory states, which can be scaled down to a ten nanometers. These properties of memristors have opened an opportunity to implement a hardware neural network as an array of cores, which consist of a bank of presynaptic and postsynaptic neurons, made with the use of conventional CMOS technology, and interconnected with an array of memristive devices. Such arrays can be arranged as cross-bar arrays, thus providing an efficient storage medium [4] , [5] , [6] , [7] , [8] , useful for the effective modeling of physiological processes [9] and the implementation of wide-spread types of neural networks, such as cellular neural networks (CNN) [10] . The comprehensive survey of memristors and memristor-based techniques with the necessary references to previous works are given in the seminal paper of the field's pioneer L. Chua [11] .
For the hardware implementation of hybrid CMOS/memristor neural networks, one of the biggest challenges is training [12] . Even simplest training algorithms, such as error backpropagation [13] , are difficult to implement at the circuit level. It is much easier to implement these algorithms off the chip and then transfer the training results [14] . Another approach is to train only the output layer of the network, which can be accomplished using a simple least-mean-squares learning algorithm [15] , [12] , [16] , [17] .
In [2] , an alternative Random Weight Change learning algorithm is used.
3
Although its convergence is theoretically proven, it appears to be very slow.
For each synaptic weight it uses four memristors in a bridge configuration.
Besides, there is an independent random number generator in each model neuron.
The scheme in [18] uses standard backpropagation scheme and two memristors for each synapse. Unfortunately, the learning is implemented there using an external training unit, and this part of the network learning essentially uses MATLAB, because it apparently does not have a simple implementation in circuitry, while the other part of the modeling is performed via SPICE simulator.
A significant advantage of the work [19] is the clearly formulated problem of non-local calculation of the signal and error product at the opposite contacts of a memristor. However, the particular solution proposed by the authors critically depends on the linearity of the memristor resistance as a function of both the amplitude and the duration of the passed current. This condition does not yet hold for the majority of the current memristor implementations.
Meanwhile, there has been a large number of works emulating a so-called synaptic plasticity in memristor devices (e.g., [20] , [8] ). They are aimed at demonstrating effects similar to the plasticity of biological synapses, such as short and long term potentiation/depression and spike-timing-dependent synaptic plasticity. Despite the fact that these models are biologicaly inspired, there is no clear idea of how to use them for the implementation of a learning procedure which would be adequate for solving machine learning problems.
In this paper, we aim to bridge this gap and propose a modification of backpropagation algorithm which allows to circumvent the described difficulties using the learning rule similar to spike-timing-dependent plasticity in synapses.
Strictly speaking, we do not give a hardware neural network implementation scheme with a memristor crossbar, but we thoroughly describe the method of pulse representation of signals and the absmin operation used instead of standard product, and also perform the modeling experiments in MATLAB environment.
The remainder of the paper is organized as follows: in Section 2.1, we give the problem statement. In Section 2.2 the backpropagation algorithm is briefly described. In Section 2.3 we propose our method of signal representation and the 4 operation used instead of standard product. Section 3 describes the test problem of handwritten digits recognition and parameters of modeling experiments. In Section 4 the results are given. Finally, the discussion of the proposed methods and the obtained results is given in Section 5.
Material and methods

Problem statement
In this work, we do not consider any specific scheme of multilayer neural network implementation on a memristor crossbar. However, we assume that a neuron is a device which sums the currents incoming from other neurons. In Section 2.3, we will show how in this case the signals x i and δ j should be represented in neurons in order to learn the network.
Backpropagation
In this section, we give a short sketch of the backpropagation technique (the readers familiar with the backpropagation learning rule can skip it and immediately move to the next section).
The task of a multilayer perceptron is to get the desired output to the inputs of certain types. With this goal in mind, the learning of perceptron is performed.
The pairs of inputs and the desired outputs are loaded to the scheme and the error of the response is determined. The parameters of the scheme (the weights of inter-neuronal connections) are changed with each load, so as to diminish the difference between the desired and the real output ( Figure 1 ). Figure 1 . The scheme of a multi-layer perceptron Let us denote the desired output of the neurons as t j , and its actual output as y j . We will further use two measures of network error -the mean-square-error M SE and the cross-entropy CE :
6 Let w ij be the weight of the connection between j-th neuron of the last layer and i-th neuron of the previous layer ( Figure 1 ). The derivative of error with respect to this variable can be written as:
where
Let the output neurons have a sigmoidal activation function f (z) = 1 1+exp(−z) , and we use equation (1) for the error calculation. Then, for the equation (3) we get:
Let us assume that the layers of neurons are numbered from 0 to L, and the weights of connections from the layer i − 1 to the layer i have the upper index i.
Then, for all intermediate layers we can obtain expressions similar to (4).
where δ
i are input signals and
so that error terms δ
are propagated backwards using transposes of weight matrices and the derivative of the activation function. Thus, the back-propagation algorithm is the pair of equations (5) and (6) with an additional rule for the weight update. Figure 2 shows how signal propagates forward and backward in the network. Alternatively, the error can backpropagate using another weight matrix
which is different from W (k) and is kept constant during the learning process.
Such method was devised in the work [21] , where elements of matrix B were chosen randomly. Formula (6) in this case can be rewritten as:
Signal representation and multiplication using memristors
Despite the fact that the following procedure is quite general, we are aiming to use it for the implementation of neuromorphic systems employing metal oxide based memristors. This class of memristors has a very non-linear behavior, and their characteristic feature is the presence of a voltage "dead zone", i.e. up to the certain voltage levels, the memristor resistance state is kept unchanged.
Moreover, the gradual change of the state in such device can be achieved by applying a voltage pulse train -the property which has been demonstrated in the works reporting on synaptic plasticity effects [20] , [8] . This property paves the way for the implementation of a learning procedure in memristor matrices using only local operations as follows.
In order to use the formula (5) in optimization algorithms using gradient descent, the weight update should be proportional to the product of two variables:
The values x i and δ j are obtained in different neurons (Figure 1 According to this procedure, the weight change ∆w ij is proportional to the minimum between x i and δ j ∆w ij ∝ min(x i , δ j ).
In case x i and δ j have opposite signs, ∆w ij should decrease. This can be achieved by changing the pulse polarity. In the general case, when x i and δ j can be either positive or negative, (9) should be modified to:
In fact, the expression (10) approximates (8) quite satisfactory for the network learning purposes (see Section 3 below), as it yields correct direction for a gradient descent process. The reason for this can be seen in Figure 5 . Therefore, this update rule will be used below in computational experiments.
In principle, the learning cycle should be divided into four subcycles for The first subcycle for the positive value of sign(x i · δ j ) has been already described above. We further consider u − to be the voltage below the lower writing threshold V of f , i.e. by applying it on the memristor one decreases its conductivity, and V of f < u − /2. In order to implement the second phase of the learning cycle (when ∆w ij < 0), let us express the signals x i and δ j in the form of pulses with an amplitude |u − |/2, as shown in Figure 4 (b). As a result, the overall bias on the memristor is negative when pulses coincide, and the weight of memristor will decrease.
In summary, the learning procedure consists of two phases, which go successively one after another and are illustrated in Figure 4 (a) and (b). During the first phase, the i-th neuron sends a pulse train of positive polarity and amplitudes |u + |/2 to the first electrode of the memristor, with the number of pulses proportional to the value of x i . Simultaneously, in case of δ j > 0 the j-th neuron sends a pulse train of negative polarity and amplitudes of |u + |/2 to the second electrode of the memristor, with the number of pulses proportional to the value of delta j , and the j-th neuron is inactive in case of δ j < 0.
In the second phase, the i-th neuron sends a pulse train of negative polarity and amplitudes |u − |/2 to the first electrode of the memristor, with the number of pulses proportional to the value of x i . Simultaneously, in case of δ j < 0 the j-th neuron sends to the second electrode of the memristor a pulse train of positive polarity and amplitudes of |u − |/2, with the number of pulses proportional to the value of δ j , and the j-th neuron is inactive in case of δ j > 0.
These two phases go sequentially one by one and form one full cycle of learning.
It should be emphasized that both phases are always performed, irrespective of whether they have an effect or not. It is necessary that the neurons, located at the opposite ends of the memristor, work independently and rely only on the common schedule.
Experiments
All modelling experiments were performed in MATLAB. In experiments, a standard two-layer network with [784-110-10] architecture was trained on the MNIST data set [23] . The MNIST dataset consists of 70,000 handwritten digits out of which 60,000 are used for the training process and 10,000 for the testing process.
During the learning phase, we used minibatches of size 100 to speed up calculations. The initial weights were selected randomly from the uniform distribution, so that w At the beginning, we used the neuron sigmoidal activation function of the following type:
In the further experiments, we used another neuron activation function called relu (rectified linear unit):
The reason for the use of such activation function lies in the simplicity of its implementation in hardware. Despite the simplicity and the discontinuity of the derivative of (12) at x=0, this function is widely used in machine learning and yields excellent results. For the weights update we used the equation (6) .
Three parameters, such as the method of multiplication, the matrix of error propagation, and the "continuality" or "discreteness" of variables in the implementation of equation (8), were varied. Each of three described parameters has two possible options. In particular:
1. As described in Section 2.3, two options of the multiplication method were examined. First of all, we used a regular mathematical multiplication procedure in accordance with (8) . We refer to this procedure as the method times. The second option was to use the approximation (9), the method called absmin.
2. As described in Section 2.2, two options for the error propagation to the connection matrices were considered: (a) when the error is "backpropagated" across the same neuron connection weights, which works for the forward signal propagation (we called such standard method as transposed), and (b) when the error is conferred to the connection weights via the randomly chosen constant connection matrix, as has been specified in the equation (7) and in [21] , which we called as const method.
3. Also, two cases in the implementation of (9) were considered: in the first one the values of x i and δ j are continuous, while in the second one these variables are made discrete before multiplication, as in the case illustrated in Figure 4 .
We assume that, when implemented in chip, the learning rate might depend on the frequency of pulses (see Figure 4 ) and therefore it can be easily varied by changing the pulse generator frequency in the whole network. Subsequently, the 13 operation of changing the learning rate can be easily achieved. In the modelling experiments, starting from the value of 10 −4 , we dynamically change the learning rate depending on the training set error: if the error decreases, the epoch learning rate increases by 10%; if the error increases, the learning rate decreases by 30%.
It is known that such dynamic change of the learning rate can speed up the learning process. We have found that in the experiment the learning rate almost always stays in the range from 10 −4 to 10 −3 . As was described above, the values of x i and δ j are represented in the form of pulses, as shown in Figure 4 . However, in the modelling experiments, we have All the results were averaged by 10 trials.
Results
The results are shown in Figure 6 and summarized in Table 4 . In the simulation experiments, we used relu (12) activation function and dynamical adjustment of the learning rate.
In the experiment, the test error smoothly decreases to about 1.8% for standard backpropagation and to 2.4% following the replacement of the times with absmin (black solid lines in Figure 6 ).
In fact, the results obtained with the method times can be considered as standard and comparable with the state-of-the-art methods for this task.
Significant improvements can be achieved only by increasing the number of hidden layer neurons and by increasing the training data set size with the use of elastic deformations of training examples ( [24] ).
The results obtained for x i and δ j discrete values are shown by dashed and dotted lines representing ≤ 100 and ≤ 20 pulses, respectively. One can see that given enough discrete levels (100 discrete values in our case) the performance is statistically close to that in a continuous case. However, the aggresive discretization leads to the rapid drop in the performance.
Discussion
The key issue addressed in our work is how to use the plasticity effects in synapses represented by memristors with multiple resistive states to locally implement the learning rule. The main distinction between our results and the related studies [25] is that we have implemented the mechanism, which is able to propagate error backwards and is needed for multi-layered networks.
This is important for the deep learning schemes. Note that in [25] The issues arising due to the thermal noise and the variability in memristors are still not resolved, however, the proposed scheme appears to be quite robust with respect to the introduction of switching errors below certain threshold, because the learning procedure does not require setting exact values on each iteration of the process -the main requirement here is to move in the general direction of the error gradient. Hence, the reduction of the learning rate during such procedure should enable the system to eventually settle down in the desired minimum. The verification of this behavior can be performed using MonteCarlo simulations, but they require at least sufficient statistical data on element performance or a good physical model, which is not available at the present moment. At the same time, the initial studies of the effect of noise as well as parameter variability on the operation of memristor-based schemes demonstrate their significant tolerance to these factors [25] . This fact speaks in support of our arguments presented above.
The important issue arising during the implementation of the backpropagation learning rule is the routing of the error signal δ in backward direction.
In this paper, we have considered two possible approaches: the propagation via the same connections (same memristors), which are used in the forward direction, or, alternatively, the propagation via a bypass connection matrix which adopts random values and remains constant during the learning (method const).
Despite the simplicity of the implementation on chip of the latter approach, our experiments have shown unsatisfactory results using such method. Nevertheless, due to the symmetry of the memristor crossbar, the error backpropagation can be done in the same way as a forward propagation during the inference step by using current summation.
The proposed replacement of the times operation with absmin turns out to give good results. The modified learning rule generally does point into the similar direction (preserving the sign) as the original gradient and yields relatively smooth performance curves (Figure 6b ). However, it also have some shortcomings. Firstly, the gradient calculation becomes inexact, which results in slightly larger error (2.4% instead of 1.8%, see. Table 4 ). Secondly, the large price to pay is the necessity of conversion of x i and δ j into discrete pulses which seems laborious and decreases the accuracy for the small number of gradations.
On the other hand, in such pulse representation, the memristor conductance changes by small steps proportionally to the number of pulses. Our approach is very close to that proposed by [19] , however, in our work we do not assume the linear dependence of the memristor conductance on the amplitude of the applied voltage. All pulses are assumed to be of the same amplitude, irrespective of the values of x i and δ j .
Finally, it should be emphasized that in this study we only demonstrate the operability of the proposed mechanisms at the proof of concept level. That is why we have restricted our work to the most commonly used benchmark-the MNIST database, and have applied the number of hidden units, which is sufficient to show that the proposed simplifications still provide learning operation quality, surpassing that of "shallow" perceptron. The detailed studies of models of memristor-based deep learning systems and their physical implementation will follow. We believe that the approaches proposed in our work can expand the capabilities of memristor technologies in the application for self-trained multilayer neural networks realization and decrease their power consumption by reducing the required number of memristors down to one-on-one connection weight. It is also worth noting that the techniques described in this paper are highly scalable.
Disclosure/Conflict-of-Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
