Abstract-We have designed, fabricated, and successfully tested a prototype mixed-signal, 28×28-binary-input, 10-ouput, 3-layer neuromorphic network ("MLP perceptron"). It is based on embedded nonvolatile floating-gate cell arrays redesigned from a commercial 180-nm NOR flash memory.
I. INTRODUCTION
The idea of using nonvolatile floating-gate memory cells in analog and mixed-signal artificial neural network circuits has been around for almost three decades [1] . Up until recently, most of its implementations were based on "synaptic transistors" [2] , which may be fabricated using the standard CMOS technology. Some sophisticated, energyefficient systems have been demonstrated [3, 4] using this approach. However, the synaptic transistors have relatively large areas (~10 3 F 2 , where F is the minimum feature size), leading to larger time delays and energy consumption [5] .
Fortunately, during the last decade the nonvolatile floating-gate memory cells have not only been highly optimized and scaled down all the way to F ~ 20 nm, but also may now be imbedded in CMOS integrated circuits [6] . These cells are quite suitable to serve as adjustable synapses in neuromorphic networks, provided that the memory arrays are redesigned to allow for individual, precise adjustment of the memory state of each device. Recently, such modification was performed [7, 8] using the 180-nm ESF1 embedded commercial NOR flash memory technology of SST Inc. [6] ( Fig. 1) , and, more recently, the 55-nm ESF3 technology of the same company, with good prospects for its scaling down to at least F = 28 nm. Though such modification nearly triples the cell area, it is still at least an order of magnitude smaller, in terms of F 2 , than that of synaptic transistors [2] [3] [4] [5] .
The main result reported in this paper is the first successful use of this approach for the experimental implementation of a (so far, relatively simple) neuromorphic network, which can perform a high-fidelity classification of images of the standard MNIST benchmark, with recordbreaking speed, density, and energy efficiency.
II. CELL CHARACTERIZATION FOR ANALOG OPERATION
Our network design (see below) uses energy-saving gate coupling [1, 5] of the peripheral and array cells, which works well in the subthreshold mode, with a nearly exponential dependence of the drain current I DS of the memory cell on the gate voltage V GS . Fig. 2 shows the dependences I DS (V GS ) for several memory states of a 180-nm ESF1 NOR memory cell. With the requirement to keep the relative current fluctuations (Fig. 3b) below 1%, the dynamic range of the subthreshold operation is about five orders of magnitude, from ~10 pA to ~300 nA, with a gate voltage swing up to 1.5 V, depending on the memory state. As the inset in Fig. 2 shows, the subthreshold slope stays relatively constant for low conductance memory states, becoming more steep at small gate voltages, apparently due to the cell's split-gate design.
The ESF1 flash technology guarantees a 10-year digitalmode retention at temperatures up to 125˚C [6] . Fig. 3 shows that these cells feature at least a-few-days analog-level retention, with very low fluctuations of the output current. Other features of the ESF1 cell arrays, including the details of their modification, switching kinetics and its statistics, and an experimental demonstration of a fast weight tuning with a ~0.3% accuracy, were reported in Refs. [7, 8] .
III. NETWORK DESIGN
The implemented neuromorphic network (Figs. 4, 5) is a 3-layer (one-hidden-layer) MLP perceptron with 784 inputs, representing 28×28 black-and-white pixels of the input patterns (such as shown in Fig. 4a ), 64 hidden layer neurons, and 10 output neurons (Fig. 4b) . Each neuron gets an additional input from a bias node. The hidden layer neurons implement a rectified-tanh activation function (Figs. 4b, 5b ).
The synaptic coupling between the neuron layers is provided by two crossbar arrays of the floating-gate memory cells. With the differential-pair implementation of each synapse [7] , the total number of utilized floating-gate cells is 2×[(28×28+1)×64 + (64+1)×10] = 101,780. The desirable synaptic weights are imported into the network by analog tuning of the memory state of each floating-gate cell, using special peripheral circuitry (Fig. 5a ). For simplicity of this first, prototype design, weights were tuned one-by-one, by applying proper bias voltage sequences to selected and halfselected lines [7] . The input pattern bits are shifted serially into a 785-bit register before each classification; to start it, the bits are read out in parallel into the network.
The vector-by-matrix multiplication in the first crossbar array is implemented by applying input voltages (either 4.2 V or 0 V) directly to the gates of the array cell transistors (Fig.  5b ). The resulting source currents, each equal to the sum of the binary voltage inputs multiplied by the pre-imported analog weights, feed operational amplifier pairs (Fig. 5e ). Signals from two differential cell rows are subtracted by feeding the output of one opamp of the pair to the input of its counterpart ( Fig. 5d) , with the output passed to the activation function circuit (Fig. 5f ). The fully analog vector-by-matrix calculation in the second array is implemented using the gatecoupled approach [1, 5] (Fig. 3b) . To minimize the error due to the subthreshold slope's dependence on the memory state ( Fig. 2 ), we use a higher gate voltage range (1.1 V to 2.7 V), limited only by technology restrictions. The 10 voltage outputs of the network are measured externally.
IV. EXPERIMENTAL RESULTS
The synaptic weights were calculated in an external computer using the standard error backpropagation algorithm, and then "imported" into the mixed-signal circuit, i.e. used to tune the memory state of each array cell to the proper value. In the reported first experiments, only ~30% of the cells were tuned, and the weight import accuracy for a single cell tuning was limited to 5%, to decrease the import time. As Fig. 7 indicates, some of the already tuned cells were disturbed beyond the target accuracy during the subsequent weight import. At this preliminary testing stage, these cells were not re-tuned -in part because even for such rather crude weight import, the experimentally tested classification fidelity (94.65%) on MNIST benchmark test patterns is already remarkably close to the simulated value (96.2%) for the same network (Fig. 8) , and not too far from the maximum fidelity (97.7%) of a MLP perceptron of this size.
Excitingly, such classification fidelity is achieved at an ultralow (sub-20-nJ) energy consumption per average classified pattern (Fig. 9a) , and the average classification time below 1 µs (Fig. 9b) . (The upper bound of the energy is calculated as a product of the measured average power, 5.6 mA × 2.7 V + 2.9 mA × 1.05 V ≈ 20 mW, consumed by the network, by the upper bound, 1 µs, of the average signal propagation delay. A more accurate measurement of both the time delay and energy would require a redesign of the signal input circuit, currently rather slow -see Fig. 9b.) 
V. DISCUSSION AND SUMMARY
The achieved speed and energy efficiency are much better than those demonstrated, for the same task, at any digital network we are aware of. These are also many ready reserves in the circuit design. For example, calculations show that current mirrors in neuron circuits may strongly decrease their currently dominant contribution to latency and energy (Fig.  9 ). Our estimates (Fig. 10) show that using such mirrors, and a more advanced 55-nm memory technology ESF3 of the same company [6] may provide an at least ~100× advantage in the operation speed, and an enormous, >10 4 × advantage in the energy efficiency, over the state-of-the-art purely digital (GPU and custom) circuits [9] , at classification of large, complex patterns using deep-learning convolutional networks -see, e.g., Ref. [10] .
To summarize, even the preliminary results reported here give an important demonstration of the exciting possibilities opened for neuromorphic networks by mixed-signal circuits based on industrial-grade floating-gate memory technologies. 32,139 non-zero weights
1.E-13
1.E-12
1.E-11
1.E-10
1.E-09
1.E-08
1.E-07
1.E-06
1.E-05 [10] , with~0.65×10 6 neurons, at its various implementations. The estimates for floating-gate networks take into account the 55×55 = 3,025-step time-division multiplexing (natural for this particular network), and the experimental values of the subthreshold current slope of the cells -see, e.g., the inset in Fig. 2 . The (very crude) estimate of the human visual cortex power consumption is based on the~25 W power consumption of~10 11 neurons of the whole brain, and a 30-ms delay of the visual cortex [11] , and assumes the uniform distribution of the power over the neurons, and the same number of neurons participating in a single-pattern classification process. The typical output signal dynamics after a turnon of the voltage shifter power supply, measured simultaneously at the network input, the output of a hiddenlayer neuron, and all its outputs. (The actual input voltage is ×10 larger.) The oscillatory behavior of the outputs is a result of a suboptimal phase stability design of the operational amplifiers. Before it has been improved, and the input circuit sped up, we can only claim a sub-1-µs average time delay of the network, though it is probably closer to 0.5 µs. 
