Analog hardware Neural Network (NN) that uses a crossbar array of synapses to store the weights of the NN provides an extremely fast and energy efficient hardware platform to implement NN algorithms. Here, we design a crossbar network with a single conventional silicon based MOSFET as a synapse. We model the synapse characteristic using SPICE, benchmarked against experimentally obtained data. We also design analog peripheral circuits for neuron and synaptic weight update calculation. Next, using circuit simulations, we demonstrate "on-chip" learning (training in hardware) in the designed network. We obtain high classification accuracy on a standard machine learning dataset-the Fisher's Iris dataset. Linear and symmetric conductance response and easy, well developed method of fabrication are the two main advantages of our proposed transistor synapse compared to most synapses, currently used for NN implementation.
I. INTRODUCTION
Analog hardware Neural Network (NN) that uses a crossbar array to store the weights of the NN has been demonstrated to implement feed-forward Fully Connected Neural Network (FCNN) algorithm with high speed and low energy consumption [1] . The ability of such a crossbar array to enable execution of Vector Matrix Multiplication (VMM), inherent in a FCNN algorithm, in a parallel fashion makes it suitable both for forward inference [1] or on-chip learning (training in hardware). In fact, on-chip learning in such crossbar array has been considered to be faster and more energy efficient than conventional training of NN on GPU [2] . However, Non Volatile Memory (NVM) devices that are used as synapses in such crossbar array to store and update weights have several issues associated with them with respect to n-chip learning. Floating gate transistor synapses need high voltage pulses for weight update and have low endurance [3] , [4] , [5] . Memristive oxide based Resistive Random Access Memory (RRAM) devices and Phase Change Memory (PCM) devices exhibit an asymmetry/unipolarity and non-linear dependence between conductance (and weight) update and programming pulses, which affects the accuracy during on-chip learning of crossbar arrays that use such devices [1] , [6] , [7] , [8] , [9] , [10] . Also large energy is consumed in "Reset" programming pulses in such implementations [6] . Moreover, fabrication of NVM device like RRAM, PCM or spintronic device based hardware NN systems [1] , [11] , [12] , require dedicated in-house fabrication facilities since they involve novel materials. The system cannot be designed in-house and then fabricated elsewhere e.g. in commercial merchant foundries, unlike silicon based conventional CMOS circuits and systems.
In this paper, we propose a single conventional silicon (Si)silicon dioxide (SiO 2 ) based MOSFET at 65 nm technology node as synapse ( Fig. 1, 2 ) and design a crossbar array based analog hardware NN using it ( Fig. 3 ) . Using conventional transistor based peripheral circuits ( Fig. 4 ) designed by us, we demonstrate on-chip learning in the NN on a standard machine learning dataset ( Fig. 5, 6, 7) . The crossbar array and peripheral circuits together form an integrated circuit. Main advantage of a hardware NN based on this kind of synapse is that it can be easily fabricated in conventional CMOS merchant foundries. Also the dependence between conductance (and weight) update and programming pulse number for the proposed synapse is linear and symmetric for weight going up and down ( Fig. 5 ) unlike PCM and RRAM synapses [1] , [6] , [7] . It is to be noted that our proposed transistor synapse does not have a floating gate and hence does not need large voltage pulses (and hence high energy) for weight update, again making our proposed synapse very suitable for on-chip learning [3] , [4] , [5] . It is also to be noted that earlier reports of conventional silicon transistor based synapse use multiple transistors to store each bit of the weight value stored in the synapse [13] , while analog weight values are stored in a single transistor here as different conductance states.
II. SYNAPTIC DEVICE CHARACTERISTIC AND NEURAL
NETWORK CIRCUIT DESIGN Fig. 1 (a) shows schematic of a Si-SiO 2 based n-MOSFET we propose as synapse in this paper. We simulate it at the 65 nm technology node through SPICE simulations on Cadence Virtuoso using the United Microelectronics Corporation (UMC) library. Drain current (I D ) vs drain to source voltage (V DS ) characteristic is linear for a certain range of V DS (0 -0.1 V in this case) for gate to source voltage V GS in between 0.6 V and 1.6 V ( Fig. 1(b) ). This behavior is expected for conventional MOSFET [15] and matches qualitatively with I D -V DS characteristic we experimentally measure on a single n-MOSFET present inside the commercially available CD4007 inverter chip and accessible through package terminals ( Fig.  2(a) ). We operate V DS between 0 and 0.1 V, and V GS between 0.6 and 1.6 V for functioning as synapse throughout this paper. For any combination of V GS and V DS , ratio of I D to V DS determines drain to source conductance (G DS ), which is a function of both V GS and V DS (G DS (V GS , V DS )) ( Fig. 1(b) ). We observe that for a fixed V GS , change in G DS (∆G DS ) is in the order of 10 −4 Ω −1 when V DS varies in the full range we have selected (0 -0.1 V) ( Fig. 1(c) ). However, for a fixed V DS when V GS varies in full range (0.6 -1.6 V), change in G DS is 5 × 10 −3 Ω −1 , which is one order higher than change in G DS due to full sweep of V DS . Thus G DS can just be approximated as a function of V GS and not V DS (and in extension I DS ) in the selected range of operation.
Hence we can write
Experimentally measured data on n-MOSFET in CD4007 chip qualitatively matches with this observation (Fig. 2(b) ,(c)). For a quantitative match, the specifications of the simulated and experimentally measured transistors must be same, which has not been the case in this work. When our simulated transistor is used as synapse in M input nodes × N output nodes crossbar array based analog hardware FCNN as shown in Fig. 3(a) , input vector-weight matrix multiplication, or VMM operation, takes place as a part of the feedforward computation both for the training phase and testing/ inference phase [1] , [14] . During this process, the input vector {x 1 , x 2 , x 3 ...x m ..x M } corresponding to a training sample acts as drain voltages in the form of 1 ns duration pulses on the transistor synapses as shown in Fig. 3(a) . The sources of all the transistor synapses are maintained at 0 V through "virtual ground"-ing using op-amps at the initial stage of the neuron circuit ( Fig. 3(a) ,(b)). Thus for a transistor synapse connecting input node m with output node n, its V DS is proportional to input x m . If its conductance G DS represents its weight w n,m then from equation (1) its drain current I D turns out to be proportional to w n,m x m . Even when V DS and hence I D changes during the VMM operation in both training and testing phase of the hardware, as long as V DS stays in the chosen range of 0 and 0.1 V this relation holds true. In any case, since we show training through SPICE simulations of the entire hardware in this paper, any effect of higher order term of drain voltage V DS on the current I D is taken into account in our analysis and final accuracy results.
At any output node n, drain currents of all the transistor synapses add up following Kirchoff's Current Law [1] , [16] to obtain z n as follows:
z n = (Σ m=M m=1 w n,m x m ) + w n,0 (2) Equation (2) represents the VMM operation which happens in a parallel and energy efficient fashion in a crossbar array based hardware NN including the one designed by us here. Neuron circuit f of Fig. 3 (b) (a differential amplifier circuit) operates on z n to carry out the "tanh" activation function [16] , [18] , [19] as follows:
y n = f (z n ) = 2 1 + e −λzn − 1 (3) Output voltage vs input current characteristic of the neuron circuit we design, obtained through SPICE simulations on Cadence Virtuoso, shows successful implementation of the "tanh" function ( Fig. 3(c) ). λ parameter in the function can be adjusted by changing a resistance in the neuron circuit as shown in Fig. 3(b) ). Change in synaptic weight ∆w n,m for synapse connecting output node n with input node m is calculated using Stochastic Gradient Descent (SGD) algorithm [16] , [17] ∆w n,m = ηλ 2 (Y n − y n )(1 − y 2 n )x m (4) We design analog peripheral feedback circuit as shown in Fig. 4 on Cadence Virtuoso circuit simulator to compute weight update following equation (4) . Amplifier, subtractor ( Fig. 4(b) ) and multiplier ( Fig. 4(c) ) blocks are designed for the purpose using conventional silicon transistor and opamps, made of such transistors [16] . η is the learning rate in equation (4) and can be adjusted in the hardware (to control the convergence rate of SGD) by changing the amplification factor of the amplifier, designed through op-amps in standard inverting configuration [20] , in Fig. 4(a) . Building a subtractor block from op-amp is a standard process in analog electronics [20] . The multiplication operation is carried out with a single transistor making use of the fact that I DS is proportional to V DS times V GS [14] .
The weight update is calculated (∆w n,m ) by the feedback circuit and generated in the form of a 1 ns long voltage pulse since input voltage pulse corresponding to a training sample is 1ns long. A voltage controlled current source, working based on the principle of the popular Howland current pump, is designed at the output stage. It converts the voltage pulse to 1 ns long programming current pulse. Magnitude of current is proportional to ∆w n,m (Fig. 4(a),(d) ). The current pulse is applied at the gate of the proposed transistor synapse connecting output node n with input node m. Fig. 5 shows that when pulses of current magnitude 132 nA and 1 ns duration with polarity such that gate oxide charges up, gate voltage V GS and hence G DS , and hence synaptic weight, increases linearly with pulse number (blue plot). Similarly starting from the highest G DS , programming pulses of opposite polarity discharge the oxide and G DS (and hence weight) decreases (orange plot). This symmetric (conductance responses trajectories overlap both for increase and decrease) and linear nature of weight update with programming pulse number is absent in RRAM and PCM based synaptic devices, making implementation of on-chip learning in crossbar arrays made of such synapses quite challenging [1] , [7] . The weight update process, described above, repeated for every training sample in each epoch results in on-chip learning.
III. ON-CHIP LEARNING PERFORMANCE
Using Cadence Virtuoso circuit simulator, we next do SPICE simulations of the transistor synapse based crossbar FCNN array, neuron circuit ( Fig. 3(b) ) and SGD based weight update circuit (Fig. 4) wired all together in a closed loop as shown in Fig. 3(a) . We train the FCNN on a popular machine learning dataset called Fisher's Iris dataset [21] . There are 16 input nodes in our FCNN array corresponding to 16 inputs: 4 input features of flowers (petal width, petal length, sepal width and sepal length) passed through 4 filters each [22] :
There are 3 output nodes corresponding to the three types/ species the flowers need to be classified to. 100 samples are used for training and 50 for testing, exhausting the entire dataset [21] .
Within each epoch, each sample is trained for 1 ns and hence each current pulse at the gate of transistor synapse is 1 ns long ( Fig. 5 ). Thus each epoch lasts 100 ns. Training accuracy vs epoch plot is shown in Fig. 6 (a) , both for SPICE simulation of the proposed NN circuit in Fig. 3 as well as execution of the same FCNN algorithm with SGD based weight update on a conventional computer through programming language Python. Similar accuracy numbers are obtained. In Fig. 6 (b), cumulative energy consumed for weight updates in all the transistor synapses in our circuit is plotted as a function of time. After 100th epoch or 10 µs, train (and test) accuracy reach 90 percent and net energy consumed in training is as low as ≈ 50 fJ (Table I) . It is to be noted that energy consumed in analog peripheral circuits is ignored in this calculation -only energy consumed in transistor synapses for weight update is considered. Peripheral circuits can be further optimized for lower energy consumption in them during the training.
IV. PERFORMANCE DEPENDENCE ON DEVICE VARIABILITY AND NOISE, AND COMPARISON WITH OTHER SYNAPSES
Our training accuracy result is not affected much by 0-10 percent noise in synaptic weights of the NN due to device variability, similar noise in inputs due to thermal noise in input voltages and similar error in computation in the feedback circuit for weight update Fig. 7 . Such robustness in training is expected in hardware NN [6] , [23] .
Speed and energy performance of our designed transistor synapses based NN is compared with similar NN designed using HfO x based RRAM synapse (Verilog model from [9] used in our simulation) and domain wall based spintronic synapse [11] , [16] with respect to on-chip learning on the same Fisher's Iris dataset in Table I . Though two RRAM devices are used per synapse in our simulation to address the asymmetry issue in conductance response [6] , [8] , [10] , several long duration (6 µs) "Reset" pulses are still needed for successful weight update [6] . Hence a lot of energy is consumed for RRAM based NN compared to our transistor synapse based NN which doesn't have that issue, as explained earlier [6] . Also, since "Reset" pulse may or may not be needed for a particular synapse while training the NN on each sample in each epoch, the net time for training varies in the case of RRAM based NN, but is certainly longer than our proposed transistor based NN due to the long "Reset" pulses. Energy for on-chip learning is only one order higher in transistor synapse based NN compared to spintronic NN and speed is comparable. However fabrication of transistor synapse based NN is much less complicated than the latter, as explained earlier [12] .
One issue with our proposed transistor synapse is that after training the weights (gate voltages) decay in about 1 ms due to leakage in the gate oxide, as observed from our SPICE simulations. However, the retention time is 10 6 times higher than time for training each sample in an epoch (1 ns) . For that retention factor, on-chip learning can still be achieved as argued in [23] , [24] and seen in our simulations ( Fig. 6 ). For testing/inference after a certain time duration from training, the proposed solution is to store the final weights after training in a floating gate synapse based NN for testing purposes, where this needs to be done only once and thereby the endurance and energy issues don't become a major bottleneck [4] , [5] .
V. CONCLUSION
We have proposed a conventional silicon MOSFET as synapse and designed analog hardware NN with it in this paper to demonstrate "on-chip" learning. Linear and symmetric conductance response and easy, well developed method of fabrication are the two main advantages of our proposed transistor synapse. Scaling up the integrated crossbar array and peripheral circuits, designed here, to handle more complex datasets will be subject of our future study.
