Deep 'Analog Artificial Neural Networks' (ANNs) perform complex classification problems with remarkably high accuracy. However, they rely on humongous amount of power to perform the calculations, veiling the accuracy benefits. The biological brain on the other hand is significantly more powerful than such networks and consumes orders of magnitude less power, indicating us about some conceptual mismatch. Given that the biological neurons communicate using energy efficient trains of spikes, and the behavior is non-deterministic, incorporating these effects in deep neural networks may drive us few steps towards a more realistic neuron. In this work, we propose how the inherent stochasticity of nano-scale resistive devices can be harnessed to emulate the functionality of a spiking neuron that can be incorporated in a deep Spiking Neural Networks (SNN). At the algorithmic level, we propose how the training can be modified to convert an ANN to an SNN while supporting the stochastic activation function offered by these devices. We devise circuit architectures to incorporate stochastic memristive neurons along with memristive crossbars which perform the functionality of the synaptic weights. We tested the proposed All Memristor deep SNN for image classification and observed only about 1% degradation in accuracy with the deep analog ANN baseline after incorporating the circuit and device related non-idealities. We witnessed that the network is robust to variations and consumes ∼ ×11 less energy than its CMOS counterpart.
I. INTRODUCTION
E VEN though the exact mechanisms of communication between biological neurons still remain unknown, it has been shown experimentally that neurons use spikes for communication and the nature of the firing of neurons (spike generation) is non-deterministic [1] , [2] , [3] . By conserving energy via spike based operation [4] , the brain has evolved to achieve its prodigious signal-processing capabilities using orders of magnitude less power than the state-of-the-art supercomputers. Therefore, with the intention to pave pathways to low power neuromorphic computing, much consideration is given to more realistic artificial brain modeling [5] . The inception of the Spiking Neural Networks (SNN) concept is a consequence of above. It has recently emerged as an active area of research owing to its resemblance of the "actual human brain" [6] .
In a spiking neural network, the communication between neurons take place by means of spikes. The information is typically encoded in the rate of occurrence of spikes. Different learning schemes have been proposed over the past, and Spike Time Dependent Plasticity (STDP) based learning is widely used due to the consistency of the concept with experimental statistics [7] . However, the STDP learning is typically limited to a network with a single layer of excitatory neurons and a single layer of inhibitory neurons [8] . The aptitudes of such a single fully connected layer spiking neural network is limited when compared with the high recognition performances offered by deep ANNs. Up to date, deep ANNs have given the best performance with respect to classification accuracy. For an example, SENet which won the 2017 ILSVRC, is a deep Convolutional Neural Network (CNN) with the reported lowest top 5 error (the correct class is not within the top 5 selection of classes according to the network output) of 2.251% on ImageNet data set [9] . However, such networks require huge power and time if a von-Neumann computer is to be used for computation. For an instance, SE-ResNet requires power for ∼3.2GFLOPS (number of operations per second) [9] .
As an effort of embedding the classification accuracy of deep NNs with the spike based low power operation of SNNs, numerous research efforts have been focused on crafting spiking deep NNs [10] . One of the interesting mechanism of executing the above is by exploiting certain optimization techniques to convert a fully trained deep ANN to an SNN [11] , [12] . The work suggested in [11] outperforms all previous SNN architectures to date on MNIST database. Despite the existence of Deep SNNs in the algorithmic level, minimal consideration is dedicated towards devices and realizing such algorithms in hardware level.
With the intention to reduce the energy consumption of the powerful Deep ANNs while preserving the biological plausibility, we propose a non-deterministic, memristive device (according to the broader definition for all resistance switching memory devices due to L. Chua [13] , excluding the Magnetic Tunnel Junctions (MTJs), and Domain Wall Magnets. These magnetic devices are current driven devices and have different circuit requirements. They have been previously studied in [14] , [15] . We propose circuit architectures to support voltage driven devices only. Note that 'memristor' term throughout this article refers to the above context) based hardware architecture for a deep SNN. Memristors have been widely used in literature as synapses in NNs [16] , [17] . The multi-level storage capability has made the memristor an ideal candidate for the synapses in a neural network. Even though this multi-level behavior of the memristors seems appealing to emulate the behavior of an analog neuron as well (different voltage write pulses (inputs) result in different memristor resistances (output); there is an upper and a lower bound for the resistances; this signals similar functionality of a thresholding function), reliability concerns might arise due to the inherent stochasticity. This stochasticity in memristors has been experimentally shown [18] , [19] and the statistical measures suggest that the switching probability of these devices can be predicted. For example, the switching times follow a Poisson distribution for Silver/amorphous Silicon/poly Silicon (Ag/a−Si/p−Si) based devices.
The memristors offer a variety of favorable features such as higher write-erase cycles (10 12 [20] ), higher yield, CMOS compatibility, lower area etc. Despite these benefits, high programming voltages and long pulse durations, [21] or other feedback write mechanisms [22] are mandatory to ensure the switching of the devices, for applications such as memory, that require very low failure rates. Rather than trying to reduce such non-deterministic effects, in this work we propose an effort to embrace the stochasticity in an efficient way, with the ambition to go towards a more realistic neuron. We propose the memristor as a probabilistic switch to represent the stochastic neuron in a supervised deep spiking neural network, and memristive crossbar arrays to represent the 'inner product' computation between the incoming spikes and the synaptic weights. We introduce this structure as 'All-Memristor' neural network due to the fact that the two main functionalities of a neural network are being taken care of by memristors. We elaborate how the deep ANNs can be trained, in order to incorporate a stochastic memristor as a neuron. The gradient descent based backward propagation scheme must be modified to account for the probabilistic function which may be different from a standard sigmoid activation function 1 1 + e −x of a neuron. We will further elaborate certain favorable features accompanied by memristors that makes it suitable to emulate a stochastic neuron. We propose circuit architectures that can be used to realize the proposed All-Memristor network. Then the impact of certain variations towards the accuracy of the network is explored. Finally we compare the energy consumption of the All-Memristor based network with the CMOS counterpart.
Even though the possibility of harnessing the inherent stochasticity of the memristor for neuromorphic computations has been mentioned previously [23] , [24] , [25] , the complete analysis of it for deep SNNs has not yet been studied. Further, stochastic integrate and fire neurons (with a focus on devices) have been proposed in literature [26] , [27] for unsupervised learning SNNs and they are different from this work, where we have specifically designed the neuron to suit deep supervised SNNs which are capable of performing complex tasks with better accuracy [11] . We provide a complete analysis for a hardware based NN with memristors, to which four key features of the brain; high accuracy, low power, spike based information transferring, and stochasticity are embedded. Note that this work is based on [14] , which explored the design of such deep stochastic SNNs for magnetic devices. This work extends the concept for memristive devices which are voltage controlled devices and therefore require a re-thinking of the design.
II. STOCHASTICITY IN MEMRISTOR DEVICES
Nanoscale resistive devices have been extensively studied as a leading candidate for non-volatile memory [28] , reconfigurable logic [29] , and analog circuits [30] . Despite the copious favorable features offered by such memristive devices, the stochasticity of changing its state has induced reliability concerns. To make the memristor a deterministic device in order to appropriate it for aforementioned applications, significant consideration must be provided to the operating region of the devices. As an example, the T iO 2 based memristive devices has a typical threshold voltage of 1V in order to 'SET' the device [31] for memory applications. This is the magnitude of the voltage write pulse that provides a higher confidence (ex: > 0.99) of writing a logic 1. If a writing pulse of 0.5V is applied instead of 1V, the device may change its state with a certain probability which is less than 1. It is evident that increasing the reliability comes at the cost of high power consumption. Our work is an effort to utilize the stochastic behavior of the nanoscale resistive devices while operating in the low-power non-deterministic regime.
The typical nano scale resistive device is based on a metalinsulator-metal (MIM) structure. The resistance change in these devices can be attributed to the formation of a conductive filament inside the insulator (Ag, amorphous Si (a − Si), P t based devices), change in the phase due to Joule heating and cooling (chalcogenide based devices), or field assisted drift/diffusion of ions (T iO 2 based devices). These processes have shown to be random in nature. For this work, we are considering the metal filament formation based devices (Electrochemical Metallization devices or ECM devices) [32] , [33] . The switching probability of these devices has shown to be predictable according to experimental data [18] . Furthermore, a-Si devices offer higher compatibility with CMOS and has very high off resistance [18] resulting in low power. It also has very high on/off ratio (10 7 ) and is capable of storing multilevels. However, we would like to point out the fact that, our proposal can be generalized for any voltage driven memristor that has a stochasticity which is predictable. Some devices maybe better in terms of power consumption and some devices maybe better in terms of endurance etc. For example, the HfO x based devices have higher endurance and lower switching time [34] . Ag/AgSiO 2 devices resets after a certain time period eliminating the requirement for resetting [35] (refer to section V for more details). The ECM devices consist of an insulating membrane (a−Si, SiO 2 , Al 2 O 3 ) sandwiched between two active(Ag) and inert electrodes(P t, T i). When the device is in its higher resistance (R of f ) state, it is considered as storing a logic '0' (in memory). When writing a '1' to the device (SET or 'turning on'), a positive voltage must be applied to the active electrode with respect to the inert electrode. At this point, the active electrode dissolution transpires and cations from the active electrode start migrating towards the inert electrode where it gets electrocrystalized, forming a metal filament (termed as the 'forming process' [33] ). Once a full metal filament has grown between the two electrodes, there is a sudden drop in resistance. The aforementioned process is graphically explained in Fig. 1 . The low resistance (R on ) stage of the device is assigned to logic '1'.
The formation of the filament is a highly voltage bias dependent process. The anode metal particle hopping rate is given by [33] 
where k B is the Boltzmanns constant, T is the temperature, τ is the characteristic switching time, and ν is the attempt frequency. −E a (V ) is the bias dependent activation energy. The time required for the formation of the metal filament is shown to follow a Poisson distribution [18] . The probability of switching within the next ∆t duration after a t amount of time can be defined as
The dependency between the characteristic switching time and the voltage of the write pulse is given by where E a is the activation energy at zero voltage bias, n is the number of anode metal particle sites, q is the charge of an electron.
If a particular write voltage pulse is applied on the memristor, according to above equations, it can be noted that the switching probability depends on two key factors.
1) The magnitude of the pulse.
2) The width of the pulse If a memristor must be incorporated as a spiking neuron, the width of the spikes must ideally be the same (variations might be present and their effect is annalyzed in the results section). Therefore, the magnitude of the pulses must be controlled to bring the memristor to its stochastic operating regime. Fig. 2 (a) shows how the magnitude of the write pulse affects the switching probability curve and Fig. 2 
III. STOCHASTIC DEEP SPIKING NEURAL NETWORKS

A. Deep neural networks basics
Deep Neural Networks (DNNs) have multiple hidden layers of neurons between the input and output layers. For an example, the deep CNN in Fig. 3 has 2 convolution layers, two subsampling layers and one fully connected layer. Each convolution and fully connected layer involves calculating the summation of some weighted inputs and then sending the outcome of it through an activation function. This output is fed as an input to the next layer. Calculating the set of synaptic weight values is called training and stochastic gradient descent method is usually used to back-propagate the error at the output and update the weight values. Typical activation functions for the deep neural networks include sigmoid function, tan hyperbolic function and rectified linear function. The activation function for the stochastic neuron in this work is a probabilistic function as will be described in the next section.
B. Stochastic neurons
Let's first consider an analog neuron with an activation function f . The input to the neuron is the weighted summation of the set of outputs from the previous layer. The output of the neuron can be given as
where x is the output vector from the previous layer and w is the set of synaptic weights. The output varies between 0 and 1 (assuming sigmoid f ). Therefore, x can be in the form of x = [0, 1] N with N being the number of fan-in neurons. In contrast, the neurons in spiking neural networks, communicate in terms of Poisson spike trains. i.e., instead of the analog input vector x, we would havex(t) = {0, 1} N where 1 represents a spiking event and 0 represents a non spiking event.
In a standard spiking neuron (known as the integrate and fire neuron or leaky integrate and fire neuron), the activities at the inputs are integrated over time until the accumulated value (membrane potential) reaches a certain threshold value. Once this threshold value is crossed, the neuron will produce an output spike (neuron fires) and reset the membrane potential.
The stochastic spiking neuron that is being discussed in this work does not temporally accumulate the spiking activities until it reaches a predefined threshold. Instead, it incorporates a probability function that observes the spiking activities at the input from the pre-layer neurons during a time step, and produce a spike with a certain probability that depends on the weighted summation of these activities. Throughout this document, the 'spiking neuron' term refers to the above context.
C. ANN to stochastic SNN conversion
In this section, we would elaborate how the ANN to SNN conversion is carried out and the error associated with it. We observed a similar explanation in [14] for a standard sigmoidal activation function. We decided to include the evaluation here since the activation function of our interest is not a standard sigmoid. When converting an ANN to an SNN, the analog input of an ANN can be rate encoded as a Poisson spike train. The expected value of the input spike events can be elaborated as (for N = 1)
where T is a sufficiently large number of time steps. Let's assume that we have considered a probabilistic activation function similar to the analog activation function f (or in other words, consider that the ANN was trained with a function similar to the probability curve f of a device). Here the neuron gives an output spike with a certain probability defined by f depending upon the input events (spiking/not spiking). When there is a spike at time t,x(t) = 1. The corresponding probability of getting a spike at the output is
Similarly, when there is no spike at time t,x(t) = 0 and the probability of getting a spike at the output isỹ(t) = f (0). The expected output can be elaborated as
As we explained in section II, the probability of switching a memristor with a voltage pulse of constant pulse width and the input (x.w) encoded as the magnitude, takes the form of Fig. 5 . The crossbar structure that represents the inner product functionality between the incoming spikes and the synaptic weights Therefore the expected value of the output is
For an ideal ANN to SNN mapping (one to one mapping), this expected value must be similar to f (x.w). However, as the above equation suggests, ỹ(t) takes a linear form with input x. As explained in [14] , the difference between f (x.w) and ỹ(t) grows with the increasing weight value (Fig. 4 (c) ). However, according to the distribution of weights illustrated in Fig. 4 (b) (for a deep ANN trained with the activation function f ), all the weight values come under the window of |w| < 2. We can now get an estimate for the error when mapping the ANN to an SNN, assuming the probability of having any spiking rate x(t) = x = [0, 1] is equally likely (uniform distribution). Fig. 4 (d) shows the error when we consider having a weight value in the range |w| < 2 according to the distribution in Fig. 4 (b) .
D. Training the ANN before converting to an SNN
As mentioned in the previous section, we use an activation function similar to the switching probability curve of a memristor given by equation (8) . The weight update rule should change according to this activation function. The stochastic gradient descent weight update rule is as follows
Where E is the cost function that must be minimized for a given input (preferably the squared error at the output). o j is the output of the j th neuron, net j is the weighted summation of inputs coming into the j th neuron and η is the learning rate.
The term ∂o j ∂net j changes according to the following equation due to the choice of our device defined activation function.
The bias values in the network are considered to be constant and do not get updated during training. The constant value corresponds to the probability of switching at the output of the neuron during an event of 'no spike'. Any output probability during a no spike event can be selected by properly adjusting this bias value.
IV. 'ALL MEMRISTOR' STOCHASTIC SNN ARCHITECTURE
In this work, we consider a deep spiking convolutional neural network which is trained offline, using the switching probability curve explained in the previous section. All the synaptic weights are realized by the conductance of multi-level memristors (4-b discretization levels were used for this case [36] , [37] , [14] . A multilevel writing scheme was proposed in [38] for TiO 2 based memristors using the model in [39] . The authors claim that the system can write any number of levels given that the on/off ratio is high). The spike trains are short voltage pulses. The inner product between the incoming spikes and the synaptic weights at time t can be efficiently calculated by using a crossbar structure. Let V (t) = {0, 1} N be the incoming voltage spikes from N neurons towards the N × M crossbar (N pre-layer neurons, M post-layer neurons in a fully connected structure). If a conductance value in the crossbar is G i,j , then the inner-product between the voltage pulses, and the conductances of the memristors connected to the j th metal column, is the current that flows through the j th metal column itself (I j (t)).
Ideally, the above value must be converted to a proportional voltage that can bring the memristor to the 'stochastic regime' as explained previously. This can be done by sending the above current through a resistor and appropriately amplifying the voltage across it. However, incorporating such measuring resistors (R meas ) cause non-ideal inner products [14] . Therefore, the measuring resistance must be made considerably small with respect to the values of other resistors that emulate the synaptic functionality. A crossbar coupled with measuring resistors is shown in Fig. 5 . Simple low power amplifiers can be incorporated to amplify the voltage across the measuring resistor ( Fig. 5 ) as required. The input impedance of the amplifier is very large. The output impedance is comparatively smaller than the off-resistances of a memristor. The output voltage of the amplifier is biased to give the same probability that the network is trained for (refer to the explanation in section III(D)) during an event of no spike. The negative weights are realized by conditionally selecting between positive and negative voltages as shown in Fig. 5 . For example, if the weight is negative at the (i, j) cross-point in the cross bar, then the memristor between the i th positive metal row and j th metal column is turned off(or programmed high) and viceversa . Each time step of operation of the SNN architecture, consists of three key tasks; write, read and reset. The write step involves the calculation of weighted addition of the spike events in a given time step using the crossbar, and applying the corresponding voltage to the memristor. In order to observe whether the memristor has switched, a read phase is carried out. This can be done by a resistor divider circuit as shown in Fig. 6 . If the memristor switched during the write phase, then the inverter output will be low. Else it will be high. These spiking events can be stored inside buffers till the next time step. After the read phase, if a memristor has switched during the write step, it will be reset to be used in the next time step. Fig. 6 shows the aforementioned temporal activities.
V. RESULTS
In order to view the functionality of the All-Memristor based stochastic deep SNN, we have created an algorithmdevice-circuit architecture framework and tested on a standard digit recognition data set, MNIST. The deep learning architecture we incorporated for this work is a convolutional neural network (28 × 28 − 6c5 − 2s − 12c5 − 2s − 10o [40] ). The CNN structure as mentioned in section III is well known for its high recognition accuracies on complex data sets and we have chosen it for this work to show the applicability of the proposed devices on state-of-the-art neural networks. It is noteworthy that this proposed device architecture is applicable to any type of ANN (ex: fully connected) since the basic computational blocks (calculating the weighted summation) remain the same. The CNN used in this work has 2 convolutional layers followed by subsampling. Each convolutional kernel is of size 5 × 5 and there are 6 and 12 feature maps at the output of first and second convolutional layers respectively. The input image is of size 28×28 (an image of a digit in MNIST data set) and the pixel intensity dependent spike activity is fed to the first convolutional layer. The input spikes can also be generated by directly applying voltages to a set of memristors with an amplitude proportional to the intensity of the pixels. The memristors would then generate homogeneous Poisson spike trains proportional to the intensity of the image pixels. After each convolution layer, a subsampling with the kernel size 2x2 is present and this is simply averaging the spiking activity of few neurons. A fully connected layer appears between the second subsampled convolutional layer output and the network output. There are 10 output neurons to account for the 10 digits (classes) in the dataset. The network was trained as an ANN for 60000 images of handwritten digits, with the probabilistic switching curve of a memristor as the activation function of neurons, following the process mentioned in section III. The stochastic memristor neuron model is built according to the set of equations elaborated in section II.
The trained network was then tested on 10000 images of handwritten digits as a spiking neural network. Instead of evaluating the outputs of neurons as analog values, the probability of switching is determined according to the voltages applied on the memristors. These voltages depend on the weighted summation of input spikes that goes in to the neuron. We observed the spiking activities at the 10 output neurons over a 100 time steps (each including a write, read, and a reset phase) and the winner is considered as the neuron that gave the highest number of spikes during the total time interval. We obtained a classification accuracy of 97.84% with a write pulse of 10ns, after detecting the spiking activity over 100 time steps. Fig. 7 shows the spiking activities at the 10 output neurons over 100 time steps for 5 randomly selected images from the testing data set. The accuracy we obtained for this network shows a slight degradation when considered with the baseline ANN with sigmoidal activation functions that provides an accuracy of 98.9%. This degradation is due to the circuit and device related considerations we took in to account while converting the ANN to an SNN. One of the reasons is the fact that we quantized the synaptic weights to suit the currently available multi-level memristors with 4-b levels. Another reason is the non-ideality due to the inclusion of the measuring resistor described in section IV. The ANN to SNN conversion error (section III) has an impact on the accuracy degradation as well.
In the next few subsections, we will analyze the impact of different types of variations on our All-Memristor deep SNN.
A. The impact of write time on accuracy
As explained in section II, in order to operate in the stochastic regime of a memristor, smaller write pulse widths require larger voltages and vice versa. It is however noteworthy that the switching probability curves for different pulse widths have almost the same sharpness ( Fig. 2 (a) ). The sharpness of the probability curve directly impacts the accuracy. Sharper curves Fig. 8 . The accuracy variation with increasing number of steps for different write pulse widths for the All-Memristor neural network will result in more classification errors. For an example, if the network was trained with a sharper curve, a slight change in a synaptic weight (due to weight quantization according to the multi-level memristors) value will result in a huge deviation at the output of a neuron to which the specific weight is connected. Fig. 8 shows how the classification accuracy varies with the number of time steps considered for different write pulse widths. Higher number of time steps will result in higher accuracy since it drives the stochastic SNN behavior more towards the analog regime. As the figure illustrates confirming our prior argument about the sharpness, we do not see any significant relationship with respect to accuracy degradation under varying write pulse width. However, it must be noted that the bias voltage in the amplifier must be increased with the reducing write pulse width. Larger voltages might damage the device and also cause in larger power consumptions.
When the network is trained for a given write pulse width (i.e., for a particular probability curve), the variations in this write pulse width when the network is in actual operation, may cause classification accuracy degradations. In order to observe this, we perturbed the write pulse width by a certain percentage and checked the accuracy at the output of the network for all the 10000 images in the testing dataset. For a network that was trained for a 100ns write pulse width, we observed only a 0.64% accuracy degradation for a 20% perturbation (i.e. 20ns perturbation), and a 0.79% degradation in accuracy for a 50% perturbation. The same percentage perturbations were applied to a network which was trained assuming a 20ns write pulse width. The degradation in accuracy we observed was 0.93% and 1.03% for 20% and 50% perturbation in write pulse width respectively. This explains that the network is very robust to the variations in the write pulse width.
B. The impact on synaptic weight variations
In our network, the synaptic functionality is given by memristive crossbar arrays. Variations can be present in the memristor resistances in these cross bars due to multiple reasons including the deviations occurred during programming, the effects of temperature, and temporal drifts in resistance due to the applied small voltages. Since such process variations is a common issue, we tested the robustness of our memristor based SNN system to variations in synaptic weights. We perturbed all synaptic weights we obtained from our modified offline training scheme, following a Gaussian distribution with different standard deviation (σ) values. Fig.  9 illustrates how the classification accuracy deviated with the increasing standard deviation (it is considered as a percentage of the weight). The accuracy degrades by ∼ 4.5% when the standard deviation is 20%. For smaller σ values around 10%, the accuracy degradation is about 0.5% and it shows the robustness of the SNN towards variations in synaptic weights. We would like to point out that neural networks are typically error resilient and might also play a role in this robustness.
Resistance variations can occur in the neuron memristor as well. However, this does not cause any significant read error at the output since the off to on resistance ratio is in the order of 10 4 − 10 7 [18] and the resistor divider circuit is capable of detecting this large drop with almost zero error. Further, as long as the amplifier output impedance is low, the write operation does not get affected by the variations in the neuron memristor.
C. Impact of variations in the bias voltage of neurons
As elaborated in section III (D), a bias value must be selected to account for the output probability of the neuron during a non-spiking event. i.e., if no spike appears at the input of the neuron, the bias voltage is the write pulse magnitude that will be applied to the memristor neuron. In this section, we would observe the effect of variations in bias voltage towards the performance of the probabilistic SNN in terms of accuracy. We perturbed all the neuron bias voltages following a Gaussian distribution with variable standard deviations from 50mV to 300mV. 50 independent Monte-Carlo simulations were conducted on all the 10000 test images. As Fig. 10 illustrates, the classification accuracy degrades by ∼ 14% when the σ is increased from 50mV to 300mV. The impact on accuracy increases exponentially with the increased amount of variations in the bias voltage. For an example, a 0.2V will result in just 3% degradation in accuracy which is almost three times smaller compared to the 14% degradation for a 300mV variation. We would thus declare that the network is robust to variations in bias voltages less than 200mV.
D. Delay and energy consumption of the SNN
As we noted in Fig. 2 , in order to get the same switching probability for a memristor, the lower write pulse widths require higher write pulse magnitudes. Since the relationship between the energy and the write pulse width is not quite intuitive, we calculated the energy consumption of a single neuron for different write pulse widths. The results are summarized in Fig. 11 . Here we assumed a spiking activity of 0.5 at the input. The results suggest that larger pulse widths result in larger energy consumption. This is due to the exponential relationship between the voltage and pulse width. For example, if the write pulse width must be reduced from 1µs to 100ns to achieve faster operation, the required voltage increment is just 500mV and the energy consumption would be better than the memristor operating in 1µs (even though the power consumption reduced).
When considering the energy consumption of the entire system per image classification, the number of time steps (write, read, and reset cycles) plays an important role. The accuracy of the network increases with the number of time steps over which the winning neuron is decided (Fig. 8) . A reasonable accuracy (above 96%) can be reached within 10 time steps as shown in Fig. 8 . However, for this experiment of finding the energy of the entire systemp, we are considering a case where the number of time steps are 50 (the classification accuracy has clearly saturated at this point).
In order to calculate the energy consumption of the whole network, we used SPICE simulations in IBM 45nm technology (write, read, reset operations). We considered the spiking activities of all 10000 images in the testing data set. The crossbar voltages must be selected appropriately (depending on the type of memristors used) so that the drift in the resistance values over time is minimal. All the important parameters involved in this work is included in Table 1 . The energy per image classification was 64nJ.
We observed that the energy of the crossbar is the dominant component and this is justifiable due to the fact that the number of synapses are orders of magnitude larger than the number of neurons. For example, the last fully connected layer of the network has 1920 synapse and the number of neurons are only 10. This is a ∼ ×200 difference and thus we state the results are justifiable. The second dominant energy component is from the reset operation. This is because of the fact that the reset must be conducted in the deterministic region of operation of a memristor. That is, a high enough voltage pulse must be used to ensure that the device has turned off. Since the resistor value is now lower, the energy consumption is larger for this step. To address this, feedback reset mechanisms can be incorporated [21] that may allow the operation in lower voltage stochastic regime with some feedback control circuitry that conditionally gets activated. Furthermore, a novel stochastic volatile memristor has been proposed in [35] as a true random number generator. It has been shown experimentally that once this memristor is turned on, it returns to its off state after a small duration of time (∼ 100µs) eliminating the requirement to force reset as we propose in this work. The write voltage is also lower (0.5V, for a 300µs pulse) in this device when compared with Ag/a − Si/p − Si (3.3V for 1ms pulse [18] ) and T iO 2 devices. This may even eliminate the requirement of high gain amplifiers that consume power. We thus argue that there are other types of memristor devices that can allow energy efficient implementation using the architecture we propose here. Our goal is addressing the applicability of electric field driven memristors in general for deep spiking stochastic neural networks. Therefore we conducted the energy calculations without the lack of generality.
E. Comparison with existing SNN topologies
For the purpose of comparison of our work, we built the same CNN as a spiking network in CMOS (IBM 45nm). The neuron we designed was a digital integrate and fire neuron Fig. 11. (a) The neuron power consumption and (b) energy consumption for different write pulse widths. The experiment was for an amplifier output voltage that corresponds to a switching probability of 0.5 for the selected write pulse width. Even though the power consumption reduces with the increasing pulse width, the energy consumption grows.
that consists of one adder and a comparator [41] . The synaptic weights were assumed to be of 4 bit precision. The weight values are initially stored in an SRAM storage and CACTI [42] was used to model these SRAM modules. The energy of the CMOS design along with the data fetching energy from the memory, is ∼ 736nJ. The energy number is for isonumber of time steps as our proposed All-Memristor network (50 steps). This is approximately 11 times larger than the energy consumption of the proposal. However, we would like to point out that CMOS neurons will not degrade over time when compared with the memristor neurons. Even though the endurance of memristors is significant (upto 10 12 setreset cycles), it would eventually fail before CMOS. Larger operating voltages may speed up this degradation process as well [43] . The on-off resistance ratio of a memristor changes after a certain number of write cycles and may have different switching probability curves other than the one used for training the network. This will lead to lowered classification accuracies. However, retraining might help in regaining some lost accuracy but the feasibility of this is debatable. Our proposed design however shows the possibility of achieving maximum accuracy within just 10 time steps (write, read, reset cycles). This actually means the effective number of writereset cycles per image classification per neuron memristor is only one (This is assuming only the output neurons where spiking usually occurs in just one output neuron and not others. i.e. 10 write-reset cycles for 10 neurons), allowing us to classify a considerable amount of images before it fails.
VI. CONCLUSION
Memristive switching devices have shown to be promising candidates for an enormous array of applications including logic, memory and neuromorphic computing. However, their inherent stochasticity has given life to reliability concerns. Numerous mechanisms involving larger write pulse widths, larger operating voltages, or feedback architectures have been proposed to drive these highly stochastic devices to their deterministic operating regime. As a result, we have to pay in terms of larger power consumption. This work is an effort of exploring an avenue where such stochasticity can be embraced rather than eliminating, with the goal of reducing power consumption. The proposal is embedding the functionality of a stochastic neuron to a memristor while representing the synaptic weights by a memristive crossbar to build the"allmemristor stochastic deep spiking neural network".
We tested the functionality of the network using the MNIST handwritten digit data set and witnessed a very low accuracy degradation (∼ 1%) when compared with the deep ANN baseline. The design space of the network was estimated by applying variations and we observed that our proposal is robust to variations in the synaptic weights (σ < 20%), neuron bias voltages (< 200mV), and the write time durations (∼ 50% of the pulse widths). Intuitively, the accuracy of the network increases when the number of time steps over which the winner neuron is decided. Our proposal shows that the accuracy reaches its maximum value within a very low number of time steps such as 10. The steepness of the activation function of a neural network affects the accuracy of the output and makes it less robust to variations. The constant steepness of the switching probability curve of a memristor (the probabilistic activation function) over different write times gives more flexibility for the memristor to be utilized in platforms with different speed limits without creating any accuracy degradation.
Smaller write pulse widths require larger voltages to bring the memristor to its stochastic region of operation. However, the required increment in voltage magnitude to operate 10 times faster is small (< 500mV) leading to lower energy consumption at the neuron in fast operating platforms. Furthermore, the total energy consumption of the proposal is approximately 11 times smaller than that of the digital CMOS counterpart.
