. A neuromorphic SoC architecture: (a) A fully-connected spiking neural network (SNN); (b) Crossbar synaptic array and column/rows of mixed-signal CMOS neurons; (c) A synapse between the input (pre-synaptic) and output (post-synaptic) neurons that adjusts its weight using STOP; (d) Floorplan of 20 SNN arrays.
. Since 'analog' memristor device technology is yet to mature while practical demonstration in neural circuits are being pursued [14] - [16] , we earlier proposed a low-risk and robust alternative for circuit prototyping using a CMOS memristor emulator [17] , [18] . In this work, we extend this CMOS memristor concept to memristive synapse circuits that realize bioplausible spike-timing dependent plasticity (STDP) learning. The rest of the manuscript is organized as follows: Section Il presents energy-estimation of memristive NeuSoCs; Section III and IV describe the CMOS memristor and synapse circuits. Finally Section V presents simulation results and application in an image classification task, followed by conclusion.
II. ENERGY-EFFICIENCY OF NEUROMORPHIC SoCs
The primary motivation of exploring memristive (or emerging NVM-based) spiking neural network is to achieve orders of magnitude energy-efficiency improvement over the contemporary digital architectures. This is expected to be achieved by employing event-driven asynchronous spiking neural networks (SNNs), with low-power circuits and ultralow-power synaptic (memory) devices. In an SNN, the spike shape parameters and the low-resistance state (LRS) resistance, RLRS, of the memristive devices (RHRS is typically order(s) of magnitude higher than R LRS ) contribute to the energy consumed in an spike event. The total energy consumption is also decided by the sparsity, i.e. the percentage of synapses in LRS state, spiking activity, and the power consumption in the CMOS neurons. Assuming a rectangular spike pulse-shape of amplitude V p and width T p , the current input signal is I syn = );:1' and the energy consumption for a spike driving a synapse with ;esistanc~RM, RLRS < RM < RHRS, is given
The approximate SNN energy consumption for one event can be formulated as Abstract-Emerging non-volatile memory (NVM), or memristive, devices promise energy-efficient realization of deep learning, when efficiently integrated with mixed-signal integrated circuits on a CMOS substrate. Even though several algorithmic challenges need to be addressed to turn the vision of memristive Neuromorphic Systems-on-a-Chip (NeuSoCs) into reality, issues at the device and circuit interface need immediate attention from the community. In this work, we perform energy-estimation of a NeuSoC system and predict the desirable circuit and device parameters for energy-efficiency optimization. Also, CMOS synapse circuits based on the concept of CMOS memristor emulator are presented as a system prototyping methodology, while practical memristor devices are being developed and integrated with general-purpose CMOS. The proposed mixedsignal memristive synapse can be designed and fabricated using standard CMOS technologies and open doors to interesting applications in cognitive computing circuits.
Index Terms-CMOS Neurons, Memristors, Neuromorphic computing, Spiking Neural Networks (SNNs), STDP, Synapses.
I. INTRODUCTION
A grand challenge for the semiconductor industry is to "Create a new type of computer that can proactively interpret and learn from data, solve unfamiliar problems using what it has learned, and operate with the energy-efficiency of the human brain [1]." Deep neural networks, or deep learning, have been remarkably successful with a growing repertoire of problems in image and video interpretation, speech recognition, control, and natural language processing [2] . However, these implementations are compute intensive and employ high-end servers with graphical processing units (GPUs) to train deep neural networks. Furthermore, the new International Roadmap for Devices and Systems (IRDS) that replaces the ITRS roadmap, looks forward to More-Moore and Beyond-Moore technologies to develop radically different data-centric computing architectures [3] , [4] . New architectures are required to transcend the device variability and interconnect scaling bottlenecks in nano-scale CMOS, should exploit massive parallelism, and employ in-memory computing as inspired from biological brains. Recent progress in memristive or resistance-switching devices (RRAM, STTRAM, phase-change memory, etc.) has spurred renewed interest in neuromorphic computing [5] - [11] . Such memristive devices, integrated with standard CMOS technology, are expected to realize low-power neuromorphic system-on-a-chip (NeuSoC) with embedded deep learning and orders of magnitude lower power consumption than GPUs, as illustrated in Fig. 1 [12] , 978-1-5386-4881-0/18/$31.00 ©2018 IEEE A Figure 2 . A CMOS memristor circuit with its pinched hysteresis curve [17] , [18] .
Here, the voltage on the capacitor, Ve, controls the gate of M 1 and is thus the 'state,' x, of the synapse. The switch (SW) prevents Cm's output from leaking the state (x == Ve) on capacitor C m when no inputs are applied. Assuming that M 1 is in triode, the dynamics of the memristor circuit are approximated as 0.4 .....
where V es and VTHN are the gate-to-source and threshold voltages; ,B = K P n If, K P n is the transconductance parameter, W / L is the sizing for MI. In order to force M 1 in triode for large drain-source voltage swings, a zero-or lowthreshold voltage (ZVTILVT) transistor is employed [17] , [18] . The simulated current-voltage characteristics for the memristor circuit, seen in Fig. 2 (c), confirms the pinched hysteresis signature of an ideal memristor. Contemporary memristive devices exhibit several limitations; they exhibit stochastic switching and variability in resistance states, depending upon the initial 'forming' step [31] - [33] . Further, it is challenging to realize stable multi-valued weights with filamentary devices [15] , [34] ; Oxide-switching devices have exhibited -9 states and their performance insitu a circuit is being investigated [35] . A greater impediment for realizing NeuSoCs is the lower LRS resistance observed in memristive devices (lOOn -10kn) [35] , which leads to energy inefficiency as detailed earlier in Section II. Thus, it is desirable to realize CMOS based memristive synapses for enabling system-level exploration while the memristive devices mature in research.
resistance between terminals A and B. The transconductor C m senses the voltage across the two terminals, produces a smallsignal current which is integrated as charge on capacitor Cm. When the strobe <[>1 is low, the capacitor C m is disconnected from the transconductor and holds the stored charge; thus realizing a dynamic analog memory.
IV. MEMRISTIVE SYNAPSE CIRCUIT
Memristive spiking circuits typically use analog spikes with rectangular positive pulse with a negative exponential tail [22] 
Acceleration over GPU [24] x14
where 'f/sparsity is the sparsity factor (i.e. the fraction of neurons firing on average), 'f/LRS is fraction of synapses in the LRS-state, N s is the number of synaptic connections, N n is the number of neurons. Pneuron is the neuron power consumption; energy consumed in the peripheral circuits is ignored to simplify the analysis. To provide a rough systemlevel comparison, the AlexNet convolutional neural network for deep learning used for the Imagenet Challenge comprised of 61 M synaptic weights and 640k neurons [19] . We assume that an equivalent SNN is constructed through transfer learning [20] , or spike-based equivalent of backpropagation algorithm [21] ; the circuit architecture is essentially the same. With an estimation based on the RRAM-compatible spiking neuron chip realized in [22] , 4-bit compound memristive synapses [14] , [15] , [23] , and RLRS ranging from 0.1-10Mn, the energy consumption for processing (training or classification) of one image is shown in Table I . By comparing with the contemporary advanced GPU Nvidia P4 [24] (170 images/slW), a memristive architecture with RLRS = 100kn provides a meagre 14x improvement in energy-efficiency. However, the energy consumption can be significantly reduced if the LRS resistance of the memristive devices can be increased to high-Mn regime, leading to a potential 1000x range performance improvement; high LRS also helps reduce the power consumption in the opamp-based neuron circuits [22] , [25] .Since there has been less focus on realizing high-LRS devices as the multi-valued memristive devices are still under development, circuit solutions are desired to address this wide energy-efficiency gap. Low I Medium I High Memristor was defined as a two-terminal circuit-theoretic concept in [26] , and later extended to a wider class of memristive devices [27] . The fundamental promise of the memristive devices lies in their 'analog' memory, that enables them to store as well as manipulate information in analogdomain. This is harnessed in neuromorphic computing, where memristors realize incremental synapses that learn based on STDP, a bio-inspired local learning rule that implements spikebased expectation maximization (SEM) algorithm [12] , [15] , [21] , [28] - [30] . The author recently proposed a compact CMOS memristor (emulator) circuit [17] , [18] . The fundamental concept is illustrated in Fig. 2 (a&b) , where an nchannel MOSFET (NMOS) M 1 implements a floating variable Even though the dynamic STDP synapses provide analog states, they can only realize short-term potentiation/depression as the capacitor memory leaks away in few milli-seconds. However in a NeuSoC, the final weights after training must be persistent and amenable for read-out/in. This is realized by employing long-term bistability in synapses where after short-term STDP learning, the weights are quantized to either a high or low binary conductance state. As shown in Figs.
3&4, a weak latch is connected to V c . This slow resolving subthreshold latch is designed for very large regeneration timeconstants (T w =1-5ms) such that it doesn't interfere with the short-term STDP learning. However, once the STDP pulses are no longer present, the weak bistable latch slowly steers the state of the synapse to either a large voltage (LRS) or to a low-voltage (HRS) long-term states, which can easily be read-out. Table II   DESIGN the synapse) and increase in synaptic weight/conductance (w == G(Vc)); synapse undergoes short-term potentiation.
Similarly, in the second case, the post spike arrives earlier than pre which in turn reduces the synapse state Vc; the synapse undergoes short-term depression. The references A + and Ad etermine the maximum synaptic potentiation/depression as for optimizing CMOS circuit area and energy consumption [36] . Current-output type bio-mimetic synapses are pervasive in literature ( [37] and references therein), where subthreshold analog techniques were used to mimic synaptic ion-channel dynamics. Most recently, [38] reported a pair-wise STDP synapse with short-term retention, and [39] combined subthreshold circuits with a latch. In contrast, we have proposed memristive STDP-learning synapse concept shown in Figure  3 and was previously disclosed by us in [17] . In this work, we expand on the previous disclosure, and present circuits and system-level details. The circuit employs the trace decay method for emulating STDP as used in the event-driven simulators for computational neuroscience [38] - [40) . The STDP weight update block converts the relative timing between pre and post spikes (!J.t = tpos t -t pre ) into change in Vc, and thus the synaptic weight. Figure 3 (b) shows schematics for the synapse, and one of several possible transistor-level implementations is shown in Fig. 4 . Here, the input pre and post pulses are converted into voltage traces Vp,exp and Vm,exp respectively, using the two Exponential Decay Circuit 1-"7 H mR"'s-16MSl Figure 5 (a) illustrates the synapse operation and the resulting pair-wise additive STDP learning function is shown in Figure 5 (b) . Here, the (pre)-synaptic spike arrives earlier than the (post)-synaptic spike. EDC output, Vp,exp, is then sampled by the post spike. This sampled voltage Vp,exp(!J.t) leads to an increase in the voltage Vc (i.e. the state of 978-1-5386-4881-0/18/$31.00 ©2018 IEEE
3
The synapse circuit in Fig. 4 is implemented in a 130nm CMOS technology with supply V DD = 1.2V. The transistor sizing and parameter values used in this circuit are listed in Table II . The mernristive synapse, for the given sizing for M 1 , realizes LRS and HRS resistances of O.4MO and 16MO respectively, providing significant improvement over Figure 9 . SNN for handwritten digits classification. Synaptic weight evolution rearranged as 8x8 bitmap for analog and the proposed bistable synapses.
VI. CONCLUSION
A compact analog mernristive STDP synapse circuit, with long-term binary retention and high LRS resistance, is introduced and designed in standard CMOS and analytical as well as simulation results are presented. The circuit is used to realize image classification application and the challenges are discussed. In summary, the synapse provides an efficient circuit solution for NeuSoC architecture exploration, while mernristive devices on CMOS platforms reach maturity. 2ms; the latch slowly resolves the synaptic state to logic high (LRS) or low (LRS).
An SNN, similar to [30] , was setup using the bistable memristive synapse and winner-take-all neuron macro-models using Brian2 libraries in Python [40] . The VCI handwritten digits dataset (3, 823 training and 1,797 test 8 x 8 bitmap images [41] ) was used to train the fully-connected SNN with 64 input and 10 output neurons, 640 synapses, and with a teacher signal enforcing the output labels. Fig. 9 shows the learned weights for each output neuron. For analog synapses the test accuracy was 83% for all 10 digits (96% for 4 digits); the bistable synapses achieve accuracy of -74% for 10 digits due to binary quantization during training. Care must be taken to ensure that T w is much larger than the time for which input samples are presented (50p,s) to avoid catastrophic forgetting. In this experiment, the bistable SNN was trained for 500 images as a large number of weights, Wijb start approaching W max = 1, resulting in loss of classification accuracy. Further, Wmin~0.01 must be used as otherwise there is a chance of all the weights getting quantized to 0, and the neurons will never fire. Long-term bistability is demonstrated through simulations in Fig. 8 where spikes are applied such that the weight crosses the latch's threshold point, Vw,thr~0.6V in Fig. 8(left) and is below threshold in Fig. 8 (right) . The weak latch is biased in subthreshold and has a regenerative time-constant of T w9 78-1-5386-4881-0/18/$31.00 ©2018 IEEE Table  II. contemporary memnstlve devices. As detailed in [22] , [30] , the traditional subthreshold neuron designs are not suitable for driving memristive load. The opamp-based integrate-andfire neurons with winner-take-all STDP learning interface from author's prior work in [30] can directly be adapted to interface with the presented synapses; higher LRS resistance will further help simply opamp design. Figure 6 . Transient simulation for a single synapse for t:.t = tpost -t pre = -lJ1.s: (left) the state, Va, is incrementally decreased, resulting in (right) corresponding change in the conductance (synaptic current, I syn ) of the synapse.
Next, a transient simulation shown in Fig. 7 is constructed to determine the STDP learning function for the synapse circuit. Here, pre and post spikes are applied with progressively changing 6.t from -10p,s to 10p,s with spacing of 50p,s to allow the transients to completely decay. This results in approximate double-exponential learning function characteristic of pairwise STDP function seen in Fig. 5(b) .
V. SIMULATION RESULTS
In this design, the total standby current drawn from V DD is ' :":' 490pA, while 10AnA is drawn during the pre/post spike event. This results in a static power consumption of 588pW and dynamic energy consumption of 91.24fJ/spike (for V pre -Vpos t = 600mV) in the LRS state. This circuit can be easily modified to different specifications and further optimized for energy-efficiency, area and speed. Figure 6 shows transient simulation for a single synapse; pre and post pulses are applied with 6.t = tpos t -t pre = -lp,s, spaced by 50p,s and the state voltage VG and synaptic current between pre and post terminals are displayed. We can observe that the weight undergoes monotonic decrease due to pair-wise STDP updates with a corresponding decrease in synapse weight/conductance, w, and thus the synaptic current, I syn = w(V pre -Vpos t ).
