> This work has been submitted to the IEEE TVLSI for possible publication. 1  Abstract-A memristive neural network computing engine based on CMOS-compatible charge-trap transistor (CTT) is proposed in this paper. CTT devices are used as analog multipliers. Compared to digital multipliers, CTT-based analog multipliers show dramatic area and power reduction (>100x). The proposed memristive computing engine is composed of a scalable CTT multiplier array and energy efficient analog-digital interfaces. Through implementing the sequential analog fabric (SAF), the engine's mixed-signal interfaces are simplified and hardware overhead remains constant regardless of the size of the array. A proof-of-concept 784 by 784 CTT computing engine is implemented using TSMC 28nm CMOS technology and occupied 0.68mm 2 . It achieves 69.9 TOPS with 500 MHz clock frequency and consumes 14.8 mW. As an example, we utilize this computing engine to address a classic pattern recognition problem − classifying handwritten digits on MNIST database − and obtained a performance comparable to state-of-the-art fully connected neural networks using 8-bit fixed-point resolution.
I. INTRODUCTION
EEP learning using convolutional and fully connected neural networks has achieved unprecedented accuracy on many modern artificial intelligence (AI) applications, such as image, voice, and DNA pattern detection and recognition [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] . However, one of the major problems that have hindered its commercial feasibility is that neural networks require great computation resources even for very simple tasks. State-of-theart digital computation processors such as CPU, GPU or DSP and their memory interface bottleneck [33] [34] [35] [36] in embedded the system-on-chip (SoC) systems [11] [12] [13] could not meet the required computation throughput within the strict power and cost constraints in many practical applications.
In addition to the above limitation, most of the modern computation processors are all implemented on Von-Neumann architecture. As the scaling of transistors is reaching its manufacturing limit, the computation throughput using current architectures will also inevitably saturate [14] . Recent research reports the development of analog computing engines [15] [16] . Compared to traditional digital computation, analog computing shows tremendous advantages regarding the power, design cost and computation speed. Among many analog computing systems, memristor-based ones have been widely reported [17] [18] . However, these devices require the introduction of new materials or extra manufacture processes, which are not currently supported in major CMOS foundries. Thus, they cannot be embedded into the commercial CMOS chips.
Recently, charge-trap transistors (CTTs) were reported to be used as digital memory devices in [19] [20] with reliable trapping and de-trapping behavior. Different from other charge-trapping devices such as floating-gate transistors [21] , transistors with an organic gate dielectric [22] , and carbon nanotube transistors [23] , CTTs are manufacturing-ready and fully CMOScompatible in terms of process and operating. The chargetrapping phenomenon in a transistor with high-k-metal gate has traditionally been considered as reliability concern, causing bias temperature instability, etc. But it was recently discovered that with a drain bias during the charge-trapping process, many more carriers can be trapped in the gate dielectric very stably, and more than 90% of the trapped charge can be retained after 10 years even when the device is baked at 85 °C [24] .
More interestingly, an analog synapse array was demonstrated to achieve unsupervised learning computation in [25] . However, the demonstrated analog synapse array only included 27 synapses, which is limited to perform any practical neural network computation and it did not consider the analog and digital interfaces, which are the most energy-consuming components in analog neuromorphic computing systems.
In this paper, we propose a memristive computing engine based on the charge-trap transistor (CTT). The proposed memristive computing engine consists of 784 by 784 CTT analog multipliers and achieves 100x power and area reduction compared to the conventional digital approach. Through implementing a novel sequential analog fabric (SAF), the mixed-signal interfaces are simplified and it only requires an 8bit analog-to-digital converter (ADC) in the system. The toplevel system architecture is shown in Fig. 1 , which will be discussed in detail in Section IV. The main contributions of this paper include:
(1) An 8-bit 784×784 parallel fully connected neural network (FCNN) memristive computing engine, using CTT-based analog multipliers, is designed and achieves dramatic area and power reduction compared to the conventional digital computing engine. D analog-digital interfaces is developed to flexibly store, calibrate or re-process inter-layer partial calculation results to guarantee analog computation accuracy. (3) A sequential analog fabric (SAF) is invented to simplify the interfaces between analog and digital domain, through eliminating the required digital-to-analog converter (DAC) and enable the parallel computation of multiple neurons. (4) A practical application, recognition of handwritten digits, using different configurations of multilayer neural network structure, is well simulated and analyzed based on CTT experimental data over MNIST dataset. (5) A number-of-bit resolution requirement study is performed, showing that an 8-bit fixed-point data format can achieve similar performance to that of 32-bit floatingpoint data format (difference less than 2%).
The paper is organized as follows. We first introduce the basics of CTT device physics and discuss how to use CTT device to make analog multiplication in Section II. In Section III, system-level challenges and considerations are described. The detailed building block designs, operations, and example experiment results are reported in Section IV, V and VI, respectively. Finally, the conclusion is drawn in Section VII.
II. CHARGE-TRAP-TRANSISTOR DEVICE INTRODUCTION

A. CTT Basics
Charge-trapping phenomenon is a well-known effect in Flash memories devices [27] . However, it is not preferred for highperformance logic or low-cost foundry technologies due to additional processes and voltage incompatibility. A fully logiccompatible CTT, without adding process complexities, has been measured and modeled in 22 nm planar and 14 nm FinFET technology platforms [28] . With enhanced and stabilized charge-trapping behavior, the CTTs are promising to be exploited as basic analog computing elements. N-type CTTs with an interfacial layer (IFL) SiO2 followed by an HfSiOx layer as the gate dielectric is used in [28] as multitime programmable memory elements. It should be noted that, although it is demonstrated only on planar SOI devices, the mechanisms apply to bulk substrates of FinFETs as well.
A schematic of the basic operation of a CTT device is depicted in Fig. 2 . The device threshold voltage VT is modulated by the charge trapped in the gate dielectric of the transistor. VT increases when positive pulses are applied to the gate to trap electrons in the high-k layer, and decreases when negative pulses are applied to the gate to de-trap electrons from the high-k layer. CTT devices can be programmed by applying logic-compatible voltages. For examples in [28] , 2V pulses were used during charge trapping operation with 1.3V drain voltage while during charge de-trapping operation, -1.3 V pulses are used with 0V drain voltage. Programming efficiency is highest at the beginning of the program operation and reduces with increasing programming time as more and more of the available electron traps are filled. A drain bias enhances and stabilizes the charge-trapping process. The trapped charge dissipates very slowly (> 8 years at 85 °C ), allowing to be used for embedded nonvolatile memory [19] . Furthermore, because CTTs are commercially available standard NMOS transistors, the process variation is well controlled with high yield rate. Therefore, it is more intriguing to use a large number of CTTs in large-scale analog computing engines, compared to other emerging memristive devices. More attractively, a very low energy consumption per synaptic operation is reported at picojoule level.
B. CTT-based Multiplication
For most neuromorphic networks, the training and inference operations rely heavily on vector and matrix multiplications in both feedforward and error back-propagation. Fig. 3 (a) shows an M-by-N fully connected neural network, or fully connected layer, in which Xi is the input data and Yi is the neuron output. The output results and input data are connected by weighted M 
= ∑ • , =1
(1) where , is the synaptic weight between the input neuron i and the output neuron j.
The precise programmability of CTT's threshold voltage enables the possibility of storing weight values locally and performing accurate analog multiplication. When CTT is biased in the triode region, its drain current is shown in Equation (2).
(2) An M-by-N CTT multiplication array, in Fig. 3 (b) , implements all the necessary computation of an M-by-N fully connected (FC) neural network. All the weight values are preprogrammed into the VT of each CTT element (NRi,j). VT of each transistor in the CTT array can be programmed by the pulse number of positive/trapping or negative/de-trapping pulses trains. Due to the fast-reading and slow-writing nature of CTTs [19] , it is desirable to store weights in the CTT threshold voltage and provide multiplicator values in the neural network inference mode, which does not require the change of weight values once they are programmed according to the pre-trained model.
While VT stores the weight value, the input data value could be fed to VDS by a voltage reference source. VGS in the Equation (2) is fixed value during operation to satisfy triode region condition. Output currents of each CTT element are summed in row resistor. If the input data values were available at the same time, all the output calculated data would be ready within one clock cycle. The voltages across row resistors can be calculated by the following Equations (3), (4), and (5) .
Here, Vout,j represents the output of Yj neural cell at Row j, VDS,i,j is transferred from input image pixel value and VT,i,j is programmed by pulse number based on pre-trained model Wi,j value. As shown in Eq. (5), the right side of the equation is separated into two terms. The first term is the wanted multiplication results while the second term is an unwanted input-data-dependent offset. Fortunately, the input data is known in the system and the offset could be easily calibrated out after the analog-to-digital converter in the digital domain.
III. SYSTEM-LEVEL ARCHITECTURE
A. System-level Considerations
Section II introduces the fundamentals of the CTT device and shows how to use CTT devices in an array to compute vector or matrix multiplication effectively in parallel. In this section, system-level considerations are discussed.
To compare CTT-based computation with conventional digital domain computation, Table I in [26] summarizes energy consumption and area occupation of 8-bit to 32-bit Multiply-Accumulate (MAC) operations in TSMC 40nm technology node.
Compared with the standard digital MAC operations, single CTT device's energy consumption per multiplication operation is one order lower than its 32-bit floating-point digital counterpart. For area occupation, it is much clearer that CTTbased computation offers more than 100 times area reduction. Although a CTT array is promising to achieve low-power, high-performance matrix computation in parallel, there are three important problems that need to be solved before utilizing it in practice:
(1) An efficient interface between analog and digital domain that enables fast and easy data format transfer between analog and digital domain. (2) A scalable and reconfigurable array that computes parallel multiple neuron values simultaneously. (3) A robust training and inference algorithm to tolerate nonlinear, process variation and other computing uncertainties. Here we will focus on solving (1) and (2) in this paper.
B. Top-level System Architecture
To solve aforementioned issues, we propose a CTT-based array architecture for efficient fully-connect layer computation, shown in Fig. 1 . The system includes a 784 × 784 CTT multiplier array and various mixed-signal interfaces such as a tunable low-dropout regulator (LDO), an analog-to-digital converter (ADC), and a novel sequential analog fabric (SAF) to enable parallel analog computing.
The number of array elements is scalable, while the mixedsignal interfaces hardware overhead is almost constant. The intermediate data can be stored in any on-chip/off-chip memory. In this proof-of-concept prototype, the intermediate data will be stored in PC memory through UART interface.
The sequential analog fabric array block is critical to feeding multiple drain voltage in parallel using only one voltage reference. A single 8-bit ADC is used to read out the partial summation results in each row for each output neuron. The detailed design of these key building blocks will be discussed in the next section. The required resolution study results are shown in Section V.
IV. BUILDING BLOCK DESIGNS AND OPERATIONS
A. Design of Key Building Blocks 1) Sequential Analog Fabric
A sequential analog fabric (SAF) is implemented in the engine to enable parallel analog computations of multiple neurons. When a set of inputs are fed into the sequential analog fabric, the fabric first transfers each data's parallel input bits into a sequence. Then each bit of the inputs will be sent out to the gate of the switching transistors in sequence to turn on/off to the drain of the corresponding CTTs. The computed results of each analog multipliers will be summed at the row resistors and sampled at ADC input. Different bits' output will be accumulated together at the digital domain after the ADC did the sampling. The computation of each bit for each output neuron takes one clock cycle, so for our 8-bit data format, all 784 outputs can be obtained with 8×784 = 6,272 cycles. Fig. 4 is a diagram showing this SAF block and its operation during the computation. The switch size of the analog fabric is carefully tuned to keep its RON to be less than 20 Ohm, which in this case will not make pre-amplifier design too challenging and not affect the overall computation accuracy.
Since only 1-bit of the input will be sent out to the multiplication array at a time, the CTTs' drain voltage will be either a fixed voltage or floating. This means that the voltage reference from the LDO is constant. Thus, the nonlinearity introduced by the VDS will become a constant offset in the computation. Compared with regular analog computing, no 
CTT Multiplication Array
Parallel to Sequential Transformation digital-to-analog converter (DAC) is required to generate multilevel input voltage for the CTTs array. In addition, since the applied voltage is constant, the required dynamic range of the sampling ADC is also reduced.
Sequential Analog Fabric
Besides reducing the mixed signal interfaces, the analog fabric also improves the engine performance through enabling parallel neurons data to be fed into the CTT multipliers' array simultaneously. As the input drain voltage to each multiplier is fixed, only a single switch is required to turn on/off the multiplier based on the current input bit value.
2) Analog to-Digital Converter
To quantify the computed result of the CTT multiplication array, an 8-bit low power SAR ADC is implemented using an asynchronous architecture in Fig. 5 , which achieves better power/speed performance compared with its synchronous structure counterparts and does not require multiple phasematched ADC clocks to be distributed. The SAR ADC is connected to the amplifier's output to sense the computed analog voltage. To improve the efficiency, the SAR ADC uses sub-radix [31] and two-capacitor DAC [32] to provide overrange protection to capacitor mismatch and insufficient settling at the expense of one more conversion cycle.
The comparator in Fig. 6 uses a double-tail latch topology with an integrator (M1P/M1N) followed by three parallel differential pairs (M2aP/M2aN, M2bP/M2bN, M3bP/M3bN) and a regenerative latch to accommodate the 1V low supply voltage. The latch reset differential pairs help to minimize the regeneration time by minimizing the device capacitances. When clk is low, the nodes dip and dim are reset to supply while the outputs op1 and on1 are discharged to ground. When clk goes high, dip and dim begin discharging to ground while the differential input signal VIPVIN is being integrated and amplified to dipdim. When dip or dim is low enough to turn on M2aN or M2aP, regeneration is triggered. A small differential-pair injecting correction current is added at the latch input for offset calibration, instead of a capacitive load of input transistors, because the heavy capacitive load increases the integration time that affects the speed.
B. Operation Procedure
The operation of the proposed engine is simple and effective. The pre-trained weight values will first be mapped to the conductance or threshold voltage of CTTs and then written into the array by counted pulse generators. In each column, the drains of the CTTs are connected together in order to reduce the number of input port hardware overhead. The drain voltage represents the bit values of the input. To enable parallel computation, each input value is decomposed into 8 bits and fed into the array in sequence, which will be handled by the SAF block. The first necessary operation in the digital domain after ADC sampling is sequential accumulation to sum the currents for all decomposed bit components in SAF and recover the complete results including the full resolution. The calculated partial summation of each bit accumulates together in the digital domain.
Before starting actual computation, a group of calibration data with known input value will be load into the CTT array. The correct calculation results have already been stored in the digital domain. The calculation results will be sampled and fed into calibration algorithm. This process is calibration initialization.
For a 784 × 784 CTT array, the number of clock cycles to write all the weights is equal to 784 times largest pulse number because 784-counted pulse generators program CTT device column by column and the largest pulse number determines how fast one column weight programming will finish. This process might be quite slow and need an extra error-correction algorithm to maintain weight accuracy. Once programming is done, those values are nonvolatile and forward propagation or inference speed is fast because of the fast-reading feature of CTTs. Consequently, the proposed computing engine is mainly targeting the inference mode, rather than the training process.
The system is able to achieve a throughput of 76,832 MACs per clock cycle. Equivalently, it is around 76.8 TOPS with 500 MHz clock frequency. The detailed flow chart of the procedure is shown in Fig. 7 . Handwritten digit recognition is an important problem that has been used as a benchmark for pattern recognition and machine learning algorithms for many years. The freely available Modified National Institute of Standards and Technology (MNIST) database of handwritten digits has become a standard for fast-testing machine learning algorithms for this purpose [29] . Samples of the 28×28-pixel images in MNIST are displayed in Fig. 8 .
In this paper, we designed three different configurations of fully connected neural networks for handwritten digits recognition and used our proposed memristive computing engine to compute them. The number of array elements is chosen based on the image size in the MNIST database. The CTT device model comes from the experiment results in [25] . With mixed-signal analog-digital interfaces, the inter-layer partial results could be stored in any type of available memory system. It is necessary because digital-assistant calibration and optimization algorithm could be utilized seamlessly to guarantee analog computing accuracy. In this proof-of-concept prototype, they are stored in PC's hard drive conveniently through low-speed UART interface.
The impacts of the resolution of analog and digital interfaces are studied and the simulation results are shown in Fig. 9 . The number of bits is swept from 1 bit to 16 bits for three different network structures: Case (1): One-layer network without hidden layers; Case (2): Two-layer network with 300 neurons in the hidden layer; Case (3): Three-layer network with 300 and 100 hidden neurons respectively in the two hidden layers. Recognition accuracies of 69.8%, 94.2%, and 95.7% are achieved in Case (1), (2), and (3) respectively using 16-bit fixed-point resolution on 10,000 testing images in MNIST database.
In the case of resolution less than 5 bits, there exist too many overflows and underflows, which makes the accuracy very low for all network configurations. However, in the case of resolution between 6 bits and 16 bits, the recognition accuracies are significantly improved and comparable to those using 32bit floating-point data format.
If the 8-bit resolution is chosen as proposed in Fig. 1 , more than 94% accuracy can be obtained in Cases (2) and (3). In all three cases, the accuracy difference between 32-bit floatingpoint and 8-bit fixed-point is within 2%. Compared to 16-bit or 32-bit computations, the 8-bit resolution reduces the hardware overhead significantly at a small cost of accuracy loss. The memristive computing engine is implemented in TSMC 28 nm CMOS HPM standard VT technology. To evaluate the area, power and critical path of pulse generator and controller, we developed the register-transfer level (RTL) design in Verilog, and then it is synthesized using Synopsys Design Compiler. We placed and routed the engine using Cadence Innovus. The 8-bit ADC is a silicon-proof IP in the same technology. The dynamic and static power consumption is estimated by Synopsys Prime Time. The other parts are designed and simulated in Cadence Virtuoso. The layout view is shown in Fig. 10 . The total core area is 0.68 mm 2 and area breakdown is shown in Fig. 11 . Table II compares the CTT engine with pure digital computing engine in terms of process, area, power, clock speed, peak MAC numbers etc. The CTT-based memristive computing engine occupies less than 12% the area while providing more than 1000 times computation resource.
VII. CONCLUSION
We have demonstrated that the CTT, as a fully-CMOScompatible non-volatile analog device, can be used in an analog computing engine to implement fully connected neural networks. The proposed architecture with novel mixed-signal analog-digital interfaces enables the computation of multi-layer fully connected neural networks, and inter-layer partial calculation results can be flexibly stored in any type of available memory or processed with any calibration and optimization algorithm to guarantee analog computing accuracy. A 784 × 784 CTT array is used for handwritten digits recognition problem and more than 95% accuracy is achieved with the 8-bit fixed-point analog-digital interface.
Finally, a physical design is provided using standard TSMC HPM 28nm PDK to estimate area and power consumption.
Since high-k gate dielectrics are expected to be present in all current and future CMOS technology nodes, the integration of the proposed architecture with other functional components should be seamless. The findings of this paper pave the way to an ultra-large scale, low power, low cost and high performance CMOS intelligent system. 
