Modern machine learning is based on powerful algorithms running on digital computing platforms and there is great interest in accelerating the learning process and making it more energy efficient. In this paper we present a fully autonomous probabilistic circuit for fast and efficient learning that makes no use of digital computing. Specifically we use SPICE simulations to demonstrate a clockless autonomous circuit where the required synaptic weights are read out in the form of analog voltages. Such autonomous circuits could be particularly of interest as standalone learning devices in the context of mobile and edge computing.
I. INTRODUCTION
Machine learning, inference and many other emerging applications 1 make use of stochastic neural networks comprising (1) a binary stochastic neuron (BSN) 2,3 and (2) a synapse that constructs the inputs I i to the i th BSN from the outputs m j of all other BSNs.
The output m i of the i th BSN fluctuates between +1 and -1 with a probability controlled by its input
(1)
where r represents a random number in the range [−1, +1], and τ N is the time it takes for a neuron to provide a stochastic output m i in accordance with a new input I i . Usually the synaptic function, I i ({m}) is linear and is defined by a set of weights W ij such that
where τ S is the time it takes to recompute the inputs {I} everytime the outputs {m} change. Typically Eqs.(1),(2) are implemented in software, often with special accelerators for the synaptic function using GPU/TPUs 4,5 . The time constants τ N and τ S . are not important when Eqs.(1),(2) are implemented on a digital computer using a clock to ensure that neurons are updated sequentially and the synapse is updated between any two updates. But they play an important role in clockless operation of autonomous hardware that makes use of the natural physics of specific systems to implement Eqs.(1),(2) approximately. For example, Eq. (1) can be implemented using stochastic magnetic tunnel junctions (s-MTJs) 6, 7 , while resistive or capacitive crossbars can implement Eq. (2) 8 (1),(2) can be used for inference whereby the weights W ij are designed such that the system has a very high probability of visiting configurations defined by {m} = {v} n , where {v} n represents a specified set of patterns. However, the most challenging and time-consuming part of implementing a neural network is not the inference function, but the learning required to determine the correct weights W ij for a given application. This is commonly done offline using powerful cloud-based processors and there is great interest in accelerating the learning process and making it more energy efficient so that it can become a routine part of mobile and edge computing.
In this paper we present a new approach to the problem of fast and efficient learning that makes no use of digital computing at all. Instead it makes use of the natural physics of a fully autonomous probabilistic circuit composed of standard electronic components like resistors, capacitors and transistors along with stochastic magnetic tunnel junctions (MTJs).
We focus on a fully visible Boltzmann machine (FVBM), a form of stochastic recurrent neural network, for which the most common learning algorithm is based on the gradient ascent approach to optimizing the maximum likelihood function 10, 11 . We use a slightly simplified version of this approach, whereby the weights are changed incrementally according to
where is the learning parameter, λ is the regularization parameter 12 and the bipolar training vectors {v} n are cycled sequentially or as an average. The term v i v j is the (average) correlation between the ith and the jth entry of the training vector {v} n . The term m i m j corresponds to the sampled correlation taken from the model's distribution. The advantage of this network topology is that the learning rule is local and tolerant to stochasticity.
For our autonomous operation we replace this equation with its continuous time version (τ L : learning time constant) source (V v,ij − V m,ij ) with a series resistance R ( Fig.(1) ):
From Fig. 1 and comparing Eqs.
(3),(4) it is easy to see how the weights and the learning and regularization parameters are mapped into circuit elements: Fig. 1 . For proper operation the learning time scale τ L has to be much larger than the neuron time τ N to be able to collect enough statistics throughout the learning process.
A key element of this approach is the representation of the weights W with voltages rather than with programmable resistances for which memristors and other technologies are still in development 13 . By contrast the charging of capacitors is a textbook phenomenon, allowing us to design a learning circuit that can be built today with established technology. The idea of using capacitor voltages to represent weights in neural networks has been presented by several authors for different network topologies in analog learning circuits [14] [15] [16] [17] . The use of capacitors has the advantage of having a high level of linearity and symmetry for the weight updates during the training process 18 .
In Section II, we will describe such a learning circuit that emulates Eqs. (1)-(3). The training images or patterns {v} n are fed in as electrical signals into the input terminals, and the synaptic weights W ij can then be read out in the form of voltages from the output terminals. Alternatively the values can be stored in a nonvolatile memory from which they can subsequently be read and used for inference. In Section III, we will present SPICE simulations demonstrating the operation of this autonomous learning circuit.
II. CLOCKLESS LEARNING CIRCUIT EMULATING EQS.(1)-(3)
The autonomous learning circuit has 3 parts where each part represents one of the three Eqs. (1)-(3). On the left hand side of Fig. 1 , the training data is fed into the circuit by supplying a voltage V v,ij which is given by the ith entry of the bipolar training vector v i multiplied by the jth entry of the training vector v j and scaled by the supply voltage V DD /2. The training vectors can be fed in sequentially or as an average of all training vectors. The weight voltage V ij across capacitor C follows Eq. (4) where V v,ij is compared to voltage V m,ij which represents correlation of the outputs of BSNs m i and m j . Voltage V m,ij is computed in the circuit by using an XNOR gate that is connected to the output of BSN i and BSN j. The synapse in the center of the circuit connects weight voltages to neurons according to Eq. (2). Voltage V ij has to be multiplied by 1 or -1 depending on the current value of m j . This is accomplished by using a switch which connects either the positive or the negative node of V ij to the operational amplifiers OP1 and OP2. Here, OP1 accumulates all negative contributions and OP2 all positive contributions of the synaptic function. The differential amplifier OP3 takes the difference between the output voltages of OP2 and OP1 and amplifies the voltage by amplification factor A v . This voltage conversion is used to control the voltage level of V ij in relation to the input voltage of each BSN. The voltage level at the input of the BSN is fixed by the reference voltage of the BSN V 0 . However, the voltage level of V ij can be adjusted and utilized to adjust the regularization parameter λ in the learning rule (Eq. (3)). The functionality of the BSN is described by Eq. (1) where the dimensionless input I(t) is given by I i (t) = V i,in (t)/V 0 . This relates the voltage V ij to the dimensionless weight by W ij = A v V ij /V 0 . The hardware implementation of the BSN uses a stochastic MTJ in series to a transistor 6 . Due to thermal fluctuations of the low-barrier magnet (LBM) of the MTJ the output voltage of the MTJ fluctuates randomly but with the right statistics given by Eqn. 1. The time dynamics of the LBM can be obtained by solving the stochastic Landau-Lifshitz-Gilbert (LLG) equation. Due to the fast thermal fluctuations of the LBM in the MTJ, Eq. (1) can be evaluated on a subnanosecond timescale leading to fast generation of samples 19, 20 . Fig. 1 just shows the hardware implementation of one weight and one BSN. The size of the whole circuit depends on the size of the training vector N . For every entry of the training vector one BSN is needed. The number of weights which is the number capacitive circuits is given by N (N − 1)/2 where every connection between BSNs is assumed to be reciprocal.
The learning process is captured by Eqs.
(3) and (4). The whole learning process has similarity with the software implementation of persistent contrastive divergence (PCD) 21 since the circuit takes samples from the models distribution (V m,ij ) and compares it to the target distribution (V v,ij ) without reinitializing the Markov Chain after a weight update. During the learning process voltage V ij reaches a constant average value where dVij dt ≈ 0. This voltage V ij = V ij,learned corresponds to the learned weight.
For inference the capacitor C is replaced by a voltage source of voltage V ij,learned . Consequently, the autonomous circuit will compute the desired functionality given by the training vectors. In general, training and inference has to be performed on identical hardware in order to learn around variations. It is important to note that in inference mode this circuit can be used for optimization by performing electrical annealing. This is done by increasing all weights voltages V ij by the same factor over time. In this way the ground state of a Hamiltonian like the Ising Hamiltionian can be found 22, 23 .
III. SPICE SIMULATIONS
In this section the autonomous learning circuit in Fig.  1 is simulated in SPICE. We show how the proposed circuit can be used for both inference and learning. As examples we demonstrate the learning on a Full Adder (FA) and on 5x3 digits images. For all SPICE simulations the following parameters are used for the stochastic MTJ in the BSN implementation: Saturation magnetization M S = 1100 emu/cc, LBM diameter D = 22 nm, LBM thickness l = 2 nm, TMR=110%, damping coefficient α = 0.01, temperature T = 300 K and demag- netization field H D = 4πM S V with V = (D/2) 2 πl. For the transitors, 14 nm HP-FinFET PTM models were used 24 with fin number f in = 1 for the inverters and f in = 2 for XNOR-gates. Ideal operational amplifiers and switches are used in the synapse. For this parameter set, the reference voltage of each BSN is given by V 0 = 50 mV 6 . The characteristic time of the BSNs τ N is in the order of 100 ps 20 and much larger than the time it takes for the synaptic connections, namely the resistors and operational amplifiers, to propagate BSN outputs to neighboring inputs (τ N τ S ).
Learning addition
As first training example we use the probability distribution of a full adder (FA). The truth table/probability distribution of a full adder with bipolar variables is shown in table I. To learn this distribution the correlation terms v i v j in the learning rule have to be fed into the voltage node V v,ij . The correlation is dependent on what training vector / truth table line is fed in. For the second line for the truth table for example v 1 v 2 = −1 · −1 = 1 and v 1 v 3 = −1 · 1 = −1 with A being the first node, B the second node and so on. In Fig. 2 b) the correlation v 1 v 5 is shown. For the sequential case the value of v 1 v 5 is obtained by circling through all lines of the truth table where each training vector is shown for 1 ns. A and C out in table I only differ in the fourth and fifth line for which v 1 v 5 = −1. For all other cases v 1 v 5 = 1. The average of all lines is shown as red solid line. Fig. 2 a) shows how the weight voltage V ij with i = 1 and j = 5 for FA learning and the first 1000 ns of training. The following learning parameters have been used for the FA: τ L = 62.5 ns where C = 1 nF and R = 5 kΩ, A v = 10 and R F = 1 MΩ. This choice of learning parameters ensures that τ L τ N . Due to the averaging effect of the RC-circuit both sequential and average feeding of the training vector result in similar learning behavior as long as the RC-constant is much larger than the timescale of sequential feeding. Fig. 2 c) shows the enlarged version of Fig. 2 a) . For the sequential feeding, voltage V 1,5 changes substantially every time v i v j switches to -1.
In Fig. 3 a) the probability distribution of the FA P Train is shown after 5000 ns of learning and compared to the ideal distribution P Ideal . The training vector is fed in as an average. Fig. 3 a) also shows the normalized histogram P SPICE of the sampled BSN configurations taken from the last 2500 ns of learning. For both P Train and P SPICE the eight trained configurations of table I are the dominant peaks. To monitor the training process, the Kullback-Leibner divergence between the trained to the ideal probability distribution KL(P Ideal ||P Train (t)) is plotted as a function of training time t in Fig. 3 b) . During training the KL divergence decreases over time until it reaches a constant value at about 0.33. Fig. 3 shows that the probability distribution of a FA can be learned very fast with the proposed autonomous learning circuit.
Learning image completion
As second example the circuit is utilized to train 10 5x3 pixel digit images shown in Fig.4 a) . The network is trained for 3000 ns and the bipolar training data is fed in as average of the 10 v i v j terms for every digit. The same learning parameters as in the previous section are used here. In 4 b) the KL divergence is shown as a function of time between the trained and the ideal probability distribution where the ideal distribution has 10 peaks with each peak being 10 % for each digit. Most of the learning happens in the first 500 ns of training, however, the KL divergence still reduces slightly during the later parts of learning. After 3000 ns the KL divergence reaches a value of around 2. For inference we replace the capacitor by a voltage source where every voltage is given by the previously learned voltage V ij . The circuit is run for 10 instances where every instance has a unique clamping pattern of 6 pixels representing one of the 10 digits. The clamped inputs are show in Fig. 4 c) . The input of a clamped BSN is set to ±V DD /2. Each instance is run for 100 ns and the outputs of the BSNs monitored. The BSNs fluctuate between the configurations given by the learned probability distribution. In Fig. 4 d) the heat map of the output of the BSNs is shown. For every digit the most likely configuration is given by the trained digit image. To illustrate this point, the amount of BSN fluctuations is reduced by increasing the learned weight voltages by factor of I 0 = 2. This step is equivalent of reducing the temperature of the Boltzmann machine by a factor of 2 25 . The circuit is again run in inference mode for 100 ns with the same clamping patterns. In Fig. 4 e) the heatmap is shown. The circuit locks in into the learned digit configuration. This shows that in inference mode the circuit can be utilized for image completion.
IV. CONCLUSIONS
We have used full SPICE simulations to demonstrate the feasibility of a clockless autonomous circuit running without any digital component with the learning parameters set by circuit parameters. Due to the fast BSN operation, samples are drawn at subnanosecond speeds leading ultrafast learning, as such the learning speed should be at least multiple orders of magnitudes faster compared to other computing platforms [26] [27] [28] . We believe the approach can be extended to other machine learning algorithms to design autonomous circuits. Such standalone learning devices could be particularly of interest in the context of mobile and edge computing. 
