Current large-scale implementations of deep learning and data mining require thousands of processors, massive amounts of off-chip memory, and consume gigajoules of energy. New memory technologies, such as nanoscale two-terminal resistive switching memory devices, offer a compact, scalable, and low-power alternative that permits on-chip colocated processing and memory in fine-grain distributed parallel architecture. Here, we report the first use of resistive memory devices for implementing and training a restricted Boltzmann machine (RBM), a generative probabilistic graphical model as a key component for unsupervised learning in deep networks. We experimentally demonstrate a 45-synapse RBM realized with 90 resistive phase change memory (PCM) elements trained with a bioinspired variant of the contrastive divergence algorithm, implementing Hebbian and anti-Hebbian weight updates. The resistive PCM devices show a twofold to tenfold reduction in error rate in a missing pixel pattern completion task trained over 30 epochs, compared with untrained case. Measured programming energy consumption is 6.1 nJ per epoch with the PCM devices, a factor of ∼150 times lower than the conventional processor-memory systems. We analyze and discuss the dependence of learning performance on cycle-to-cycle variations and number of gradual levels in the PCM analog memory devices.
Training a Probabilistic Graphical Model With
Resistive Switching Electronic Synapses I. INTRODUCTION D EEP learning can extract complex and useful structures within high-dimensional data, without requiring significant amounts of manual feature engineering [1] . It has made significant advances in recent years and is shown to outperform many other machine learning techniques for a variety of tasks, such as image recognition, speech recognition, natural language understanding, predicting the effects of mutations in DNA, and reconstructing brain circuits [2] . However, training of large-scale deep networks (∼10 9 synapses, compared with ∼10 15 synapses in human brain) in today's hardware consumes more than 10 GJ (estimated) of energy [3] , [4] . An important origin of this energy consumption is the physical separation of processing and memory, which is exacerbated by the large amounts of data needed for training deep networks [1] - [5] . It has been reported that ∼40% of energy consumed in general-purpose computers are due to the off-chip memory hierarchy [6] , and this fraction will increase when applications are more data-centric [7] . GPUs do not solve this problem, since up to 50% of dynamic power and 30% of overall power are consumed by off-chip memory as shown in several benchmarks [8] . On-chip SRAM does not solve the problem either, since it is very area inefficient (>100 F 2 , F being the minimum half-pitch allowed by the considered lithography) and cannot scale up with system size.
Extracting useful information from data, which requires efficient data mining and (deep) learning algorithms, is becoming increasingly common in consumer products such as smartphones, and is expected to be even more important for the Internet-of-Things [9] , where energy efficiency is especially crucial. To scale up these systems in an energy-efficient manner, it is necessary to develop new learning algorithms and hardware architectures that can capitalize on fine-grained on-chip integration of memory with computation.
Because the number of synapses in a neural network far exceeds the number of neurons, we must pay special attention to the power, device density, and wiring of the electronic synapses, for scaled-up systems that solve practical problems. Today, synaptic weights in both conventional processors and neuromorphic processors [10] - [12] are currently implemented in SRAM and/or DRAM. Due to processing limitations, DRAM needs to be on a separate chip or connected by chip stacking using through-silicon-via [13] - [15] that has limited via density. This results in increased power consumption and limited bandwidth for memory accesses. SRAM, on the other hand, occupies too much area (>100 F 2 per bit, 58 800 nm 2 in 14-nm node CMOS technology [16] ) and thus limits the amount of local memory that can be accessed efficiently [11] . "New" nonvolatile resistive memory elements [17] , such as phase change memory (PCM) [18] , resistive switching memory (RRAM) [19] , conductive bridge memory (CBRAM) [20] , and ferroelectric memory [21] , have characteristics that are desirable as electronic synapses. These are two terminal devices with very good size scalability (PCM: 1.2 nm, RRAM: 3 nm, and CBRAM: 20 nm demonstrated, corresponding to an area of 5.8-400 nm 2 ), low energy programmability (PCM: < 2 pJ, RRAM: 0.4 pJ, and CBRAM: 0.7 pJ per bit demonstrated), and analog programmability for implementing synaptic weight within a single device [22] , [23] . Monolithic 3-D integration of these nonvolatile memories with CMOS, demonstrated in [24] and [25] , allows designers to hide the logic circuitry underneath multiple layers of synapses, reducing silicon cost and increasing synapse density [4] . Gradual resistance change of these devices has been utilized as a synapse for variety of algorithms [26] - [33] . These devices can implement variations of biological [18] learning rules within a single device, further suggesting their use for neuromorphic hardware.
Supervised learning refers to finding the right model with labeled data, whereas unsupervised learning refers to fitting the right model to the underlying probability distribution of data while discovering useful features. Unsupervised learning has played a crucial role to revive interest in deep learning [2] and is expected to become far more important in the future due to: 1) exponentially increasing amount of data that is not labeled; 2) inherently unsupervised nature of learning in humans; and 3) improved generalization with unsupervised pretraining observed in some experiments [34] . Unsupervised learning using restricted Boltzmann machine (RBM) is an important element of deep neural networks for successful generalization especially in environments, where huge amount of labeled data are not available [1] , [34] , [35] . RBMs trained with contrastive learning rules are the backbone of latest research in physical instantiations of deep learning [36] . A probabilistic generative model was recently demonstrated to produce much richer representations of data with less examples than purely supervised deep learning models [37] . RBM is a two-layer probabilistic generative model that efficiently represents the underlying probability distribution of data in distributive fashion, where one layer consists of visible neurons that is associated with observations, and another layer consists of hidden nodes (neurons) [1] . Contrastive divergence (CD) algorithm [38] is a common technique for training RBMs, and is also biologically plausible [39] .
This paper presents a first demonstration and analysis of CD learning in an RBM with nanoscale electronic synapses. RBM is able to learn 3 × 3 images and retrieve a missing pixel with more than 80% probability of success when at most five patterns are stored (higher probability of success for less patterns stored). Further improvements in learning performance can be obtained by more precise control of PCM conductance change through device engineering or better programming schemes.
II. EXPERIMENT AND RESULTS
The fabricated array of PCM devices is shown in Fig. 1(a) . The wordline pitch and bitline pitch of the 10 × 10 array are 1.4 and 1.2 μm, respectively. Phase change elements consist of mushroom GST (Ge 2 Sb 2 Te 5 ) cells in series with 180-nm CMOS n-channel transistors [40] [ Fig. 1(b) ]. Fig. 1 (c) shows that dc switching from high resistance state to low resistance state occurs at around 2 μA. Fig. 1(d) shows the resistance distribution of cells within a 10×10 array for binary switching. Fig. 1(e ) and (f) shows binary switching and gradual switching behavior, respectively. As we show later, the network is quite robust to device-to-device and cycle-to-cycle variations observed in gradual switching.
Single-layer RBM with nine visible nodes and five hidden nodes is implemented on PCM array [ Fig. 2(a) ]. Neurons have binary activations in an RBM (0 or 1). RBM parameters encode probability of binary input vectors v as Nine input neurons represent each pixel in a 3 × 3 image for the task we performed in this paper. Since SET and RESET rates are very asymmetrical for PCM devices, learning cannot be efficiently realized with single cell synapses [29] . Hence, we chose to use two PCM devices per synaptic weight [29] [see Fig. 2 
Parameters consist of single-node biases for visible (nine parameters) and hidden (five parameters) nodes, as well as pairwise synaptic weights (45 parameters), as shown in Fig. 2(a) , adding up to 59 parameters used to represent the probability distribution of input data that consists of binary vectors with length 9.
In order to fit the parameters on the 10 × 10 array, we only treat pairwise weights as variables. We fix visible and hidden node biases at 0 and do not train them. The 0 bias corresponds to the assumption that each node is equally likely to be ON or OFF with no information from other nodes, which is indeed 
where G + and G − are the conductances of each of the two differential PCM cells within the 2-PCM synapse. Weights are mapped on the network as w i j = (G i j − M)/S, where G is the effective synaptic conductance as described earlier, and the constants M and S are the mean and standard deviation of initial resistance values. M and S are determined after initialization phase, and stays the same during training and for the whole network (see Table I ). The initialization procedure starts with applying a strong RESET pulse for each PCM device. After initialization, training proceeds as described in Fig. 3 , using the bars-and-stripes data set in Fig. 3(b) [41] .
The CD algorithm updates the parameters w i j as follows to maximize data log likelihood: [38] . Here, v i h j data is the average (over the data) of pairwise interaction of activations of each visible and hidden neuron, and v i h j model is the average of the same quantity, but taken over the probability distribution represented by the model during that iteration before the weight update. A cycle that consists of updating the weights after presenting all images (or a subset of images selected as data set) in Fig. 3(b) is called an epoch, or iteration. In one iteration, the term v i h j data is calculated, as shown Fig. 3 (a) and Table I . v i h j is calculated for each data vector by sampling hidden neurons and is averaged for all data set. This is done by measuring the total current through the source node for each hidden neuron when the corresponding read voltage (0.1 V) is applied at the bitline (depending on which visible nodes are ON) and corresponding wordline (depending on which hidden neuron input is being measured). Output of each hidden neuron is then 1 with a probability given in Fig. 3(a) , and 0 otherwise. Exact calculation of the term v i h j model is intractable, so approximation is made by Gibbs sampling, which is shown to be very efficient for RBM and empirically give good results with CD training [1] , [38] . For Gibbs sampling, input image is presented, and top-down iterations are performed k-times (we use k = 3), where the input is removed after it is presented at the beginning [1] , [38] . During these iterations, visible node outputs (from top to bottom) and hidden node values (from bottom to top) are calculated probabilistically, using the conductance values of PCM devices [see Fig. 3 (a)]. Calculation of v i h j model and v i h j data is performed by software on a computer, while the PCM hardware serves as electronic synapses. Once v i h j data and v i h j model are calculated, depending on which term is bigger for each synaptic weight, either G + or G − device is applied gradual SET pulse [see Fig. 2 (c) and Table I ], overall resulting in 45 gradual SET operations in one update phase.
We monitor the training procedure by tracking the Kullback-Leibler (KL) divergence [42] between the probability distribution of the data set and the probability distribution represented by the network at every iteration. KL divergence is a similarity metric between two probability distributions, and smaller KL divergence means that the probability distribution represented by the network can model the probability distribution of the original data more accurately. Computing KL divergence exactly is in general intractable, but can be efficiently and relatively accurately approximated using annealed importance sampling (AIS) [43] , and we use AIS here to estimate the KL divergence. Fig. 4(a) shows that the evolution of PCM conductances starts to saturate after the 10th epoch, and only minor changes to conductances are observed from 10th until the 20th epoch. This causes the KL divergence to improve overall until the 10th epoch. Toward the 10th epoch, the improvement slows down, and toward the 20th epoch, the progress starts reverting, more severely when the data set (the patterns stored in network with training) contains larger number of distinct patterns. The fluctuations in KL divergence starting toward 10th epoch and the accompanying slowdown of improvement in KL divergence can be explained by the observation that for a given device, the gradual resistance change starts to become small compared with the cycle-to-cycle variation in Fig. 1(f) . The onset of unlearning (increase in KL divergence) toward 20th epoch happens when for sufficiently large number of synapses, either the G + or the G − device saturates. When G + saturates, for instance, in the subsequent iterations, it cannot respond to the change in G − , degrading the learning feedback loop and causing "unlearning" ("forgetting") as training progresses further.
The 2-PCM synapse hardware implementation is compared with a computer simulation of the CD algorithm, where the synaptic weight is stored in the form of double-precision floating point numbers (labeled as "64-b synapse"). It is interesting to note that 2-PCM synapse results in better learning until 10th epoch, but the 64-b synapse results in lower KL divergence after the 10th epoch, as shown in Fig. 4(b) . This is due to the onset of saturation effect within PCM devices described earlier, whereas 64-b synapse has very high precision and does not saturate. Fig. 4(c) shows some inference tests where the probability of the value of missing pixel estimated by the network before and after training is given for some cases. In all these cases, the RBM performs significantly better after training compared with before training, predicting the right combinations after training. Note that even when two pixels are missing [see Fig. 4(c) ], the network can predict the right combination among three other combinations after training, while its prediction was wrong before training. We measure ∼3.2 nJ/epoch total energy consumed during training within the PCM devices in the array. This is much less than the energy consumption estimated for a conventional computer performing the same task, as described later.
We further study the relationship between the number of iterations and the number of patterns stored in the network on the correct recovery of a missing pixel (error rate). Analysis of error rate is more applicable to practical cases than KL divergence, although KL divergence gives a good indication of training performance. We performed five trials for each of different cases where different numbers of patterns are stored in the network. During inference, all cases of one missing pixel are evaluated (nine cases for one trial) and the probabilities are averaged for each case. For each number of stored patterns and for each trial, a subset of training set shown in Fig. 3(b) is randomly selected with the number of elements of the subset being equal to the number of stored patterns. As Fig. 5 For both cases, we used the same learning rate for weight update equations. It is worth noting that one can find an optimum learning rate for 64-b synapse, such that it starts performing better than 2-PCM synapse before 30th epoch. Recovery error rate for 64-b synapse, although worse than 2-PCM synapse for the first 30 epochs in our experiment, continues improving until 70th epoch, and becomes better than the best case of 2-PCM synapse. The 2-PCM synapse is preferable for applications where some performance can be traded off in favor of huge energy savings (see Section III).
III. DISCUSSION
The fluctuations in Fig. 4(b) for the 2-PCM synapse case compared with 64-b synapse case can be attributed to: 1) limited control over the conductance change of PCM devices from cycle-to-cycle and 2) implementing signed constant-amplitude updates rather than the full gradient in Fig. 3(a) . It is expected that achieving more gradual levels through device engineering [44] - [46] or a different programming scheme [44] can lead to better results in training. A better control of gradual conductance change also helps mitigate the impact of device-to-device variations during training, since the weight updates would take care of this type of variation through changing the weight accordingly [4] . In fact, deviceto-device variations intrinsically allow symmetry breaking at the initialization stage which is needed for initializing weights in neural networks before training [1] . For conventional hardware, this is done by initializing the weights by assigning a value from Gaussian distribution with mean 0. On the other hand, very large device-to-device variations at the initialization stage can be harmful for convergence due to excessive variation of initial weight values [1] . In this experiment, we scaled the conductances by the standard variation to obtain the weight values (see Table I ) to avoid very large resistance deviations at the initialization. This can be implemented on circuit level either by using an appropriate voltage level for read pulses or configurable neuron parameters.
Energy consumed by the wires is only a few picojoules, so it is negligible compared with 3.2 nJ/epoch programming energy consumed within PCM. However, wire energy is important for larger arrays, hence should be considered when scaling up in system size [22] . For a 1k × 1k array, wires consume from 30 pJ for 20-nm [22] . For comparison, our devices consume ∼72 pJ in a partial SET for this experiment. To compare the energy efficiency of our approach with a conventional hardware, we estimate synaptic energy consumption for the same training procedure that is run on conventional hardware. For a direct comparison, we only compare the energy consumption within the synaptic weights, and we do not include the energy consumption due to computation within the neurons. We assume that five patterns are stored during training, and we measure the average energy consumption over five trials with PCM hardware. Using PCM hardware, average energy consumption due to programming is 3.2 nJ/epoch. Energy consumption due to read is much larger, since the integration time of our equipment is 80 ms. For a fair comparison, we assume 50-μs integration time, which is the specification of a commercially available current-input ADC with 20-b resolution [47] . Note that we do not include power consumption of the ADC, since the optimum design of neuron circuits and periphery is outside the scope of this paper. This results in 2.9 nJ/epoch energy consumption due to read on average (see the Appendix for details). Read energy also depends on the resistance distribution of PCM cells across the array, higher resistance being more favorable. As a result, read energy/epoch increases over time as the device resistances decrease due to gradual SET operations. Overall, training and inference results in 6.1 nJ/epoch energy consumed within PCM cells. This task corresponds to the following for a conventional hardware. 1) Read weights from nonvolatile storage (here, assumed on-chip as described in the following) into processor (assume no DRAM). 2) Perform seven matrix-vector multiplications to perform KL divergence shown in Fig. 3(a) . 3) Perform matrix addition for weight updates. 4) Write back the result into nonvolatile storage. Note that in a realistic scenario, we might not update weights very often; hence, keeping the updated weights in nonvolatile storage is desired. For a fair comparison, we assume PCM that is monolithically integrated on top of processor for nonvolatile memory access, which is recently proposed as the next-generation hardware for data-abundant applications [6] . With the characteristics of PCM used in our experiment, and reported energy consumption of vector operations in Intel Xeon Phi processor (in 22-nm technology node), we estimate 910 nJ energy for the equivalent synaptic operations performed on PCM in our experiment (430 nJ within logic and 480 nJ for memory access; not accounting for energy consumption of the memory access circuits and sense amplifiers for a fair comparison). Hence, our hardware consumes ∼150× lower energy per epoch on average, compared with the next-generation conventional hardware with on-chip memory. Assuming 16-b synapses for digital conventional hardware and assuming 16-b ADC used in neuromorphic PCM hardware (20-μs read time [48]) gives 230 nJ for conventional hardware and 4.4 nJ for PCM hardware for synaptic operations. We further estimate energy consumption for PCM-based neuromorphic hardware and conventional hardware using data obtained from large-scale PCM array measurements in the literature. An estimate using the characteristics of a 1-Gb PCM array in 45-nm node [49] results in 590 nJ for conventional hardware, and 19 nJ for PCM hardware, translating to >30× energy savings in favor of PCM neuromorphic hardware. Big energy savings here come from tightly coupled memory and processing, and avoiding data movements between memory and processing elements. Details of these estimates are given in the Appendix. These estimates should provide a perspective for comparison between analog resistive memory-based neuromorphic hardware and conventional hardware. Table II shows the comparison of this paper with the previous demonstrations of learning using resistive memory-based synapses. Compared with the previous works, our approach is unique in that it combines: 1) representational power of probabilistic graphical models and 2) biologically plausible learning rule. Although the prototype shown in this paper is small with respect to targeted applications, it can be scaled to larger sizes if the saturation effect can be mitigated. Apart from solutions that involve device engineering or modification of programming scheme as discussed earlier, refresh procedure can be used to mitigate the saturation effect in PCM [30] . This method requires that resistance values should be refreshed periodically to high-resistance regime, while keeping the synaptic strength the same within some margin. Studies have shown that this method mitigates saturation effect with sufficiently low refresh periods [30] , although more experimental data and studies on the overhead of refresh operation are needed. This refresh scheme might not be necessary if the device can exhibit bidirectional gradual resistance change as opposed to unidirectional resistance change observed in this experiment.
In this paper, the quantity R OFF /R ON is between 10 and 100, as observed in Fig. 1(f) , under the pulsing conditions used in the experiment. It was reported elsewhere that when two cells are used in differential mode to implement a synaptic weight as in this paper, R OFF /R ON requirement relaxes substantially, since the dynamic range of synaptic weights (G + − G − ) can cover very small weights (as small as the precision of gradual resistance change allows) without the need to have R OFF to be much bigger than R ON [4] . It is important to note that when synaptic weights consist of two memory cells in differential mode, R OFF /R ON needed is a function of: 1) the precision of the control of gradual resistance change and 2) the absolute value of the ratio of largest and smallest weights for the network that can model the data distribution well. For larger networks, the second item might have a different requirement, but a ratio of 500 for R OFF /R ON was reported to give good results in large-scale network simulations with realistic memory device models [4] . It is also possible that the finite R OFF /R ON might act as a regularizer during training by limiting the dynamic range and can be useful for better generalization after learning [1] , but this requires further analysis of devicealgorithm interactions.
In the small-scale array used in our experiment, yield was not an issue and we observed full yield, mainly because: 1) the small size of the array and 2) devices that would be considered as failed to be used for binary memory due to very low R OFF /R ON ratio can still function as an analog memory device for the purposes of learning, to the extent that it can exhibit sufficiently good gradual resistance change behavior. We expect yield to be a more important issue for practical larger-scale networks, and more studies regarding the effect of yield on learning performance are needed.
While in this paper we evaluate the performance of RBM individually, RBM is very commonly used as a pretraining method for deep networks, where weights are fine-tuned using backpropagation algorithm following RBM training [35] . For this type of application, the performance of RBM should be evaluated together with the backpropagation phase following RBM training.
IV. CONCLUSION
We demonstrate a proof-of-concept implementation of a probabilistic graphical model (RBM) using 45 synapses implemented with 90 PCM elements trained with CD. We monitor learning through KL divergence as well as inference tests when 1-2 pixels are missing and error rate of recovery when a pixel is missing. Synaptic operations consume 6.1 nJ/epoch, compared with an estimated 910 nJ/epoch for the state-ofthe-art conventional processor. Fluctuations in the learning progress are observed when conductance change control is not good enough to overcome cycle-to-cycle variations. This paper reveals the opportunities in utilizing emerging nonvolatilememory devices for probabilistic computing, and provides useful guidelines for refining the programming schemes and device characteristics for efficient learning.
APPENDIX
For binary switching measurement in Fig. 1(d) and (e), alternating SET pulses (1.5 V amplitude and 100 ns/800 ns/2 μs rise/width/fall time) and RESET pulses (2.5 V amplitude and 5/50/5 ns rise/width/fall time) are applied at the wordline, and a wider pulse that overlaps SET and RESET pulses in time is applied on the bitline (10/1000/10 μs rise/width/fall time and 3.3 V amplitude). For gradual SET [ Fig. 1(f) and learning phase during the experiment], a gradual SET pulse (10/50/10 ns rise/width/fall time and 1.1 V amplitude) is applied at the wordline, and a wider bitline pulse is applied as above. RESET pulse with the same specifications above is applied to initialize the resistance values of all PCM cells in the 10 × 10 array.
We estimate energy consumption for a similar task implemented on a Xeon Phi processor [50] with integrated digital RRAM (hypothetical) instead of off-chip RAM. In CD, a pass from visible to hidden nodes is a multiplication of two matrices A × B, where A has rows equal to number of data vectors stored (five in the example), and with columns equal to number of visible neurons (nine), and B has number of rows equal to number of visible neurons (nine) and number of columns equal to number of hidden neurons (five). This corresponds to 73.125 vector operations on average, which results in 73.125 nJ energy consumption, where each vector operation consumes 1 nJ [50] . Similarly, a pass from hidden layer to visible layer corresponds to 45 vector operations on average, which consumes 45 nJ. For simplicity, we assume that a vector addition consumes the same amount of energy with vector multiplication in Intel Xeon Phi, since [50] only reports the energy for vector addition and not multiplication. We further assume that synaptic weights in conventional hardware are 64-b double precision floating points. Overall, one iteration of CD update has four visible-to-hidden pass and three hiddento-visible passes, resulting in 427.5 nJ energy consumption. At the end, weight update is performed by adding W to W (W is the weight matrix), which is another six vector addition, consuming 6 nJ. For estimates using 1-Gb PCM array [49] , we extract the SET and RESET voltage and current from the figures in [49] as 1.8 V/100 μA for SET and 2.2 V/200 μA for RESET for calculating energy due to SET and RESET. Since the pulsewidth for programming pulses is not reported in [49] , we assume a RESET time of 50 ns, SET time of 400 ns (taken from [51] ), and read time of 20 ns for calculating energy values. For calculating mean conductance to use in read energy calculation, we used the equation (1/R low + 1/R high )/2, where R low and R high are 10 k and 2 M [49] .
