In-memory computing with analog nonvolatile memories can accelerate the in situ training of deep neural networks. Recently, we proposed a synaptic cell of a ferroelectric transistor (FeFET) with two CMOS transistors (2T1F) that exploit the hybrid precision for training and inference, which overcomes the challenges of nonlinear and asymmetric weight update and achieves nearly software comparable training accuracy at the algorithm level. In this paper, we further present the circuit-level benchmark results of this hybrid precision synapse in terms of area, latency, and energy. The corresponding array architecture is presented and the array-level operations are illustrated. The benchmark is conducted by multilayer-perceptron (MLP) + NeuroSim framework with comparison to other capacitor-assisted (e.g., 3T1C + 2PCM) hybrid precision cell. The design tradeoffs and scalability are discussed between different implementations. INDEX TERMS Benchmark, ferroelectric transistor (FeFET), in-memory computing, neural network, synaptic device.
I. INTRODUCTION
Nowadays, intensive research and development efforts have been focused on deep learning accelerators with CMOS and beyond-CMOS technologies [1] - [6] . To meet the requirement for data-centric computation, a new computing paradigm, compute-in-memory (CIM), has been proposed [7] - [10] . The emerging memory devices for binary memory applications are featured of high integration density, fast read and write speed, nonvolatility, good data retention, and reasonable endurance. The analog synapses based on emerging non-volatile memory (NVM) devices such as phase change memory (PCM) [11] and resistive random access memory (RRAM) [12] additionally offers many conductance states. These features make it a promising candidate for the CIM, at least for the inference of neural network, where read operations are intensive.
However, the high nonlinearity and asymmetry in the weight update curve (conductance versus the number of programming pulses) prevent the PCM or RRAM-based synapses from achieving high training accuracy as suggested by the device-algorithm cosimulations [13] , [14] . To avoid this problem, static RAM (SRAM)-based digital synapse is introduced by parallelizing the access [9] . Although SRAM offers good training accuracy, due to its volatility, weights need to be loaded from off-chip NVMs before it conducts inference, which is time and energy consuming. Capacitorsbased analog synapses (e.g., 3T1C cell [15] ) offer good linearity and fast operation. However, it suffers from the same volatility problem in addition to a small dynamic range.
By combining the volatile memory (e.g., capacitor) with nonvolatile storage device (e.g., PCM), the good linearity of the capacitor and the nonvolatility and high dynamic range of NVM can be leveraged. In general, the weight with lower numerical significance is stored in the volatile capacitor for training because of its good linearity and fast write speed. This weight is termed the low-significance weight (LSW) W LSW . The weight with more numerical significance is stored in the NVM device, which guarantees that the majority of the weight information is kept when the power is turned OFF after training. Correspondingly, this part of weight is termed high-significance weight (HSW) W HSW . A significance factor F is defined to indicate the numerical significance of the HSW weight. The total weight stored in a hybrid precision cell can be represented by FW HSW + W LSW . For example, in an 8-bit weight, the first 4 bits from the left can be defined as HSW and stored in the NVM device.
The rest 4 bits can be defined as the LSW and stored in the capacitor. The significance factor F = 16 in this case. After a certain amount of training batches, the LSW is transferred to HSW by reprogramming the NVM device. In this paper, F is the integer with the power of 2, i.e., 2, 4, 8, . . . This hybrid representation is well aligned with the observation that in typical deep learning algorithms, a relatively higher precision (larger than 6-bit) is necessary during training to accumulate the incremental weight change, while a lower precision (less than 2-bit) is sufficient during inference to achieve a reasonably good accuracy [16] for representative data sets including MNIST, CIFAR, and ImageNet.
In [17] , a hybrid precision cell based on PCM and capacitor is proposed, where HSW are stored in two PCMs to represent HSW while the LSW is represented by one capacitor and three transistors. We call it 3T1C + 2PCM synapse in this paper. The recent research progress in an analog ferroelectric transistor (FeFET) [18] enriches the family of the hybrid precision synapse. In [19] , a FeFET-based hybrid precision synapse cell is proposed. In this design, both LSW and HSW are stored as the channel conductance of the FeFET. However, the LSW is modulated by the gate voltage of FeFET, and the HSW is modulated by the polarization states of the ferroelectric gate dielectric. Similar to the 3T1C + 2PCM synapse, the LSW here is volatile and the HSW weight is nonvolatile. Since there are two CMOS transistors for charging and discharging in this design, we call it 2T1F synapse in this paper.
Although the device-level operation of the FeFET-based hybrid precision cell has been discussed and the algorithmlevel training accuracy has been reported in our prior work at IEDM 2018 [19] , the array-level operation and circuit-level benchmark have not been done yet. In [19] , we have shown the online learning accuracy of 2T1F synapse for MNIST and CIFAR-10 data sets that approaches the software accuracy. In this paper, we present the array design, synapse layout, and benchmark results by customizing our multilayer-perceptron (MLP) + NeuroSim simulator [20] , [21] for FeFET-based hybrid precision synapse. The 3T1C + PCM synapse is used as a baseline for comparison.
The organization of this paper is organized as follows. In Section II, the schematic of the 2T1F synapse is introduced and the device level operation is briefly described. In Section III, the array level design for 2T1F synapse and the array level operation are illustrated in detail. In Section IV, the methodology for the array-level benchmark is described, the benchmark results are presented, and the design tradeoffs and scalability are discussed. The conclusions are drawn in Section V.
II. DEVICE SCHEMATIC FOR HYBRID PRECISION SYNAPSE A. PRINCIPLE OF 2T1F-BASED HYBRID PRECISION SYNAPSE
In the 2T1F design, the LSW is encoded into the gate voltage of the FeFET by charging and discharging its gate capacitor, which effectively tunes the channel conductance, assuming the FeFET is kept in the linear region. Step 1 shown in Fig. 1 (a) corresponds to this mechanism. The HSW is programed by multi-domain polarization switching in ferroelectric material (e.g., doped HfO 2 such as Hf 0.5 Zr 0.5 O 2 ) [18] , which also tunes the channel conductance of the FeFET by changing the threshold voltage, corresponding to steps 2 and 3 shown in Fig. 1(a) . The LSW is volatile since if the gate voltage is disturbed by leakage or power-off, the channel conductance tuned by the gate voltage will diminish, and the LSW information is lost. However, the HSW is nonvolatile since the polarization state will remain even if the power is turned OFF. The number of polarization states of the FeFET is determined by the weight precision required for inference after training. For example, if a 2-bit weight is required for inference, four polarization states are needed for the FeFET.
B. SCHEMATIC OF FeFET-BASED HYBRID PRECISION SYNAPSE
The schematic of the 2T1F synapse is shown in Fig. 1(b) . Two CMOS transistors are used as access transistor (AG1 and AG2) to program the LSW by modulating the gate voltage of the FeFET. The gate of the FeFET is connected to the drain of the two access transistors. To increase the LSW weight, AG1 is turned ON by applying pulses with low voltage on word line 1 (WL1) while AG2 is turned OFF. The gate node will be charged by current flowing through the PMOS from VDD. To decrease the LSW weight, AG2 will be turned ON by applying high-voltage pulses on WL2 while AG1 is turned OFF. The gate node will be discharged with current flowing to ground through the NMOS.
To program the HSW, a positive bias is applied to the substrate and the gate node is grounded to first reset the existing polarization in ferroelectric materials, which is followed by a high programming positive voltage (2-4 V) pulse to reprogram the polarization to the desired level while the substrate is grounded. The LSW stored at the gate node is discarded because in this process the gate voltage is disturbed. The gate voltage is charged to an intermediate level for the LSW after the HSW write (so-called weight transfer process).
To read the weight stored in the 2T1F synapse, a read voltage (e.g., 0.5 V) is applied at the bitline (BL) and the current will be sensed by the periphery circuit (e.g., currentmode sense amplifier) connected through the select line (SL). It should be noted that weight read-out is the total weight W total , i.e., W LSW + W HSW . This is because in 2T1F design both LSW and HSW are stored as the channel conductance of the FeFET, which is different from the 3T1C + 2PCM synapse to be illustrated next.
C. 3T1C + 2PCM SYNAPSE
In this paper, the 3T1C + 2PCM synapse is used as a baseline for comparison. Here, we briefly talk about its schematic, as shown in Fig. 1(c) . Similar to the 2T1F synapse design, the LSW is programed by charging or discharging the capacitor, which modulates the channel conductance of the CMOS transistor by tuning its gate voltage. The HSW is stored in a PCM pair G + and G − , both of which are only programed with long-term potentiation (LTP) pulses and no long-term depression (LTD) pulses are used. The HSW is represented by the conductance difference between G + and G − , i.e., G + − G − . To increase the HSW, G + is programed to a higher conductance level by applying LTP pulses. To decrease the HSW, G − is programed to a higher conductance level, and therefore, G + − G − gets smaller.
Weight transfer from LSW to HSW is required to prevent the capacitor from operating in the nonlinear region. The LSW W LSW is first read-out and converted to HSW by W HSW = W LSW /F. Then, the PCM pairs are programed to the desired level. The LSB weight will be reprogrammed to an intermediate level after weight transfer.
Different from the 2T1F synapse, here, the LSW and HSW are physically separate. Within the HSW, the weight is stored in G + and G − separately. Therefore, three-step read-out is conducted by first reading out HSW, multiplying with F and adding it with LSW. It will be described in detail in Section III-A.
III. ARRAY ARCHITECTURE DESIGN FOR 2T1F SYNAPSE A. ARRAY DESIGN FOR 2T1F SYNAPSE
In Section II, the schematic of the 2T1F synapse cell is described in detail. However, this schematic only enables device-level operation. When aligning the 2T1F synapses into the array, the synapses in the same row share the same WL1 and WL2. For example, if the WL1 is turned ON, the LSW of all the synapses in the same row will be programed to a higher level. It goes the same for the 3T1C + 2PCM schematic. To solve this problem and enable independent control of different columns (for different weight update values), two additional CMOS transistors are introduced into each synapse to control the access to VDD and GND, respectively. These two transistors are referred to as power gate (PG) in this paper and their gates control lines are termed as power lines PL1 and PL2, respectively.
The array design with 2T1F synapse is shown in Fig. 2(a) , where the synapses with PG are shown in a zoomed-in inset. In this design, both PGs and access gates are I/O transistors because high voltage (2-4 V) is delivered to program the FeFET gate for multi-domain polarization switching.
In the synaptic array, the WLs and BLs are spread in the horizontal direction and are controlled by the WL and BL switch matrices, respectively. The PLs and SLs go in the vertical direction and are controlled by the PL and SL switch matrices, respectively. The PLs and WLs enable crossbar switch for each individual synapse and thus individual programming. To make the schematic look more succinct, WL1s and WL2s are merged to be one WL in this figure. It goes the same for PL1s and PL2s. The substrate is split between rows, and the body contact (BC) of FeFET synapses in a row is controlled by the BC switch matrix.
A reference column is incorporated at the right side of the synapse array with synapses finely tuned to intermediate conductance level to represent the reference ''0.'' With reference columns, the weight of each individual synapse is represented by subtracting the reference weight. Therefore, both positive and negative weights can be used in the typical neural network algorithms.
In the periphery circuitry, a mux is used to select the column to be read-out since periphery circuit is shared among columns. The flash-ADC (implemented by multi-level sense amplifiers with different references) converts the analog current to a digital number. The subtractors, adders, and shift registers are aligned in sequential to operate as digital neurons. The transmission gates inside the mux and the switch matrices are analog to pass high current.
For 3T1C + 2PCM synapse, the array design is shown in Fig. 2(b) . In this paper, it is assumed that the LSW and HSW are first converted to digital values by ADC and then added up to get the total weight, although in [17] , they are first summed up in an analog manner and then converted to digital values. The WLs of the 3T1C cell and WLs of the PCM cells are controlled by a separate switch matrix, i.e., the PCM WL switch matrix and the 3T1C WL switch matrix, respectively. Besides, a 2-1 mux is used to select between the reference weights (for LSW read-out) or the HSW that is read-out previously (for HSW read-out) since the 3T1C + 2PCM requires a three-step read. The details about the array operations are described in Section III-B.
B. ARRAY OPERATION FOR 2T1F SYNAPSE
Following the illustration of array design, the array-level operations for 2T1F-based synapse array are described here. In general, for online training, three operation modes are needed: partial sum read, LSW write, and HSW write.
1) PARTIAL SUM READ
A parallel read-out scheme is applied here to maximize the read speed. The steps are illustrated as follows.
1) The BL is biased at the read voltage V read if the row input is ''1,'' and the BL is ground if the row input is ''0.'' 2) The partial sum current I psum of the selected column flows to the periphery circuit through the SL.
3) The analog partial sum current I psum is converted to digital values D psum by ADC. 4) At the same time of operation 1)-3), the reference column's digits D ref is obtained by a similar process. 5) The partial sum is obtained by subtracting the reference digits D ref from D psum by the subtractor. 6) Shift the D psum to the significance of the input bits and store. 7) For multi-bit input, shift the input bits to the next, and repeat 1)-6). Then, add it up with the previous partial sum. 8) Repeat the above process until the partial sums of all columns are read-out since mux is used for timemultiplexing. For the 3T1C + 2PCM cell, the partial sum read operation is three-step. First, a two-step read is applied to the partial sum corresponding to HSW (HSW Psum). In this process, the partial sum corresponding to the G + synapses (G + Psum) is first read-out and then subtracted by the partial sum corresponding to G − synapses (G − Psum). Before subtraction, the 2-1 mux forward the G + Psum to the subtractor. Then, the HSW Psum is shifted to left by an amount of log 2 (F) and stored in the register. In this paper, the significance factor F is chosen to be an integer power of 2 to make sure a shift amount is an integer number. It also eliminates the requirement for a multiplier to multiply the HSW Psum by F. The total partial sum is obtained by reading out the partial sum corresponding to the LSW (LSW Psum) and adding it up with the HSW Psum. In LSW Psum read, the 2-1 mux forward the partial sum of the reference column to the subtractor.
2) WRITE LSW
In online training, the frequent weight update is only applied to the LSW part because of its low write latency. The row-byrow operation of LSW write can be described as follows.
1) Turn ON the PGs of the selected FeFET cell.
2) If weight update W > 0, charging pulses are applied to the AG1 through WL1 to charge the gate node. If W < 0, discharging pulses are applied to the AG2 through WL2 to discharge the gate node. We could apply the different number of charging or discharging pulses to different columns as we have independent control due to the PGs. For the 2T1F synapse to be programed, their PGs are turned ON so that current can flow into/out of its gate node when pulses are applied at the WLs. For those 2T1F synapses not to be programed, their PGs are turned OFF so that no current flow into or out of the gate node. The LSW write operation in the 3T1C + 2PCM cell can be conducted in the same way.
3) WRITE HSW (WEIGHT TRANSFER)
After certain numbers of training batches, the LSW is transferred to HSW for the following two reasons: 1) prevent FeFET from operating in the nonlinear region due to the excess voltage on the gate node and 2) store the LSW before the gate voltage leaks due to the OFF-state leakage current. The operation of weight transfer is described as follows.
1) Read the 2T1F and get the total weight stored, which is symbolled as W . The read process is row-by-row here because the weight in each individual synapse is needed. 2) Calculate the new HSW by W HSW = W /F, where F is the significance factor of the HSW. 3) Erase the existing polarization state by applying a positive body bias to the selected rows while keeping the gate nodes grounded by turning ON the two NMOS PG2 and AG2. For the unselected rows, the body is kept grounded for inhibition. Fig. 3(a) shows the row-byrow erase scheme. The body of each row is separated so that different body bias can be applied. It should be noted that this process changes the gate voltage and therefore disturbs the LSW. 4) Reprogram the HSW to its desired level through multidomain polarization switching, as shown in Fig. 3(b) . For the selected rows, the AG1 and PG1 are turned ON to apply a high programming voltage at the gate node while the body across the array is grounded. For the unselected rows, the AG1 and AG2 are turned OFF. 5) After programming the HSW, the gate voltage is precharged to an intermediate level, and therefore, the LSW is discarded. Weight transfer in 3T1C + 2PCM synapse-based array is slightly different. First, only LSW weight W LSW is read-out. The amount of weight that is to be transferred to HSW is calculated by W HSW = W LSW /F. If W HSW > 0, the G + PCM cell is programed to a higher level by turning ON the WL of its access gate and apply write pulses at its BL. Otherwise, if W HSW < 0, the G − PCM cell is programed to a higher level, and thus, HSW is reduced. Similarly, LSW is discarded by programming the capacitor voltage to an intermediate level after weight transfer.
IV. BENCHMARK RESULTS AND DISCUSSION

A. DESCRIPTION OF BENCHMARK METHOD
The array-level benchmark is conducted by NeuroSim, a simulator developed to evaluate the performance of the neural network hardware accelerators with emerging device technologies. The detailed methodology for the area, energy, and latency estimation in NeuroSim can be found in [20] and [21] . To estimate the area of synapse array, the layout of a FeFET-based synapse is designed, as shown in Fig. 4(a) . As mentioned previously, I/O transistors are utilized for the PGs and access gates to pass the high programming voltage to write HSW. The width of the I/O NMOS and I/O PMOS are 400 nm and 1.12 µm with TSMC 65-nm technology, respectively. The size of the FeFET is L = 2 µm and W = 4 µm. The reason that the size of FeFET is relatively large can be explained as follows: 1) at present, multi-domain polarization switching is only demonstrated at microscale scale [18] ; 2) large gate capacitance (∼100 fF) is needed so that the LSW incremental V about 20 mV could be obtained; and 3) alleviate the gate-to-drain coupling at the FeFET gate node due to the large C gd of the I/O transistors. The total area of a 2T1F synapse cell is about 31.86 µm 2 in our design. Fig. 3(b) shows the way that the 2T1F synapses are aligned in an array. For the 3T1C + 2PCM synapse, it is assumed that a MOS capacitor with the same capacitance is used for the LSW part. Therefore, the area of the MOS capacitor is the same.
A two-layer 400 × 200 × 10 MLP neural network is used to estimate the online training accuracy for MNIST data set. The software baseline is 96% with 1M training images. Weight transfer is implemented every 8000 images. Therefore, the total number of write operation through polarization switching for each 2T1F synapse is 125, which is only 0.12%-1.2% of the typical 10 4 -10 5 endurance for a FeFET [22] .
The device parameters used in the simulation are listed in Table 1 . For 3T1C + 2PCM synapse, the nonlinearity factor of the capacitor is assumed to be 0.2 and −0.2 for LTP and LTD, respectively, obtained from [15] , and it is 0.105 (LTD only) for the PCM device assumed from [23] . For 2T1F synapse, the nonlinearity is assumed to be 0.5/0.5 according to [19] . It should be noted that the nonlinearity is defined for the entire 2T1F synapse because both the HSW and LSW are stored as the channel conductance of the FeFET.
More details about the definition of nonlinearity factor and variation can be found in [20] . The technology node is 65 nm in our benchmark before the scalability analysis discussed in Section IV-C. The pitch of the interconnect wires is assumed to be 200 nm in the simulation at 65-nm node. For the periphery circuit, ADC precision is assumed to be 8 bits so that it is not the bottleneck for learning accuracy. The impact of ADC precision on learning accuracy could be found in [10] .
To make a fair comparison, the number of bits used in the training process for a hybrid precision synapse is designed to 6 bits. For 3T1C + 2PCM synapse, the significance factor F is 4. The number of HSW and LSW bits are 4 (16 conductance states) and 5, respectively. Therefore, there are 3 bits overlapping between HSW and LSW, which makes the total number of bits 6. For the FeFET-based hybrid precision synapse, there are 2 bits stored by the multilevel domain switching as HSW and its significance factor is 16.
B. BENCHMARK RESULTS AND DISCUSSION
The benchmark results for online training 1M images are shown in Table 2 . The performance of synapses based on GST PCM [23] , HZO FeFET [18] , and TaO x /HfO x RRAM [12] alone are also presented here for comparison.
In terms of the online training accuracy, the two hybrid precision synapses show better accuracy than synapses based on PCM, FeFET, and RRAM alone due to their improved linearity and more number of conductance states. The reason that 3T1C + 2PCM synapse shows slightly lower on-line training accuracy can be explained by smaller read margin induced by the high interconnect resistance due to large cell size and the low ON-state resistance (R ON ) of PCM. But the 3T1C + 2PCM device can flexibly adjust the number of bits stored in a synapse by changing the significance factor F and adjusting the number of conductance levels in the PCM cells. For 2T1F-based synapse, it is hard to increase the number of synapse bits due to the limited range of the dynamic gate voltage that makes FeFET in the linear region.
The area cost for the hybrid precisions synapses is increased significantly due to the more complex cell structure, indicating that there is a design tradeoff between accuracy and area cost. RRAM and single FeFET-based synapses show small area cost as they have relatively larger ON-state resistance (R on ) than PCM, thus the peripheral analog mux area could be minimized. RRAM shows slightly larger area cost than FeFET-based synapse because of its lower R on .
For latency, in both 3T1C + 2PCM and 2T1F synapses, weight transfer contributes significantly to the total latency. For 3T1C + 2PCM, both weight increase and decrease for HSW are conducted by set pulses at G + or G − cell, respectively. The relatively long set pulsewidth (6 µs) leads to long weight transfer time. For 2T1F synapse, long programming pulsewidth (3 µs) for the multi-domain polarization switching contributes to its long transfer time. However, compared with synapses based on PCM, FeFET, or RRAM alone, the total latency for hybrid precision cells is reduced significantly due to the short programming pulsewidth (300 ps) of the capacitor.
For energy consumption, 2T1F cell shows lower total energy consumption than 3T1C + 2PCM cell due to lower energy consumption during weight transfer. It can be explained by the fact that the multi-domain polarization switching is actually assisted with the electrical field and the programming current is extremely low. However, the mechanism for PCM programming is joule heating-assisted crystalline phase change, which needs high current to induce joule heat in order to achieve phase change.
For leakage, both hybrid precision synapses show high leakage due to the subthreshold leakage at the PGs. However, the leakage power of the 2T1F synapse is higher due to the larger width of the PGs and access gates, which are I/O transistors.
C. PROSPECT OF SCALING
The area cost for a 2T1F synapse is high according to the benchmark results. The major limitation is the coupling issue between the drain capacitance of pull-up pFET and pulldown nFET and the gate capacitance of FeFET during the charging/discharging. Since I/O transistors are used for pFET and nFET to deliver high programming voltage, it is required that the gate capacitance of FeFET to be much larger than that of the I/O transistors to obtain effective charging/discharging on FeFET (instead of on I/O transistors). Therefore, to scale down the size of the 2T1F device, it is critical to reduce the programming voltage of FeFET and thus reduce the size of PGs and AGs. As a side effect, the dynamic gate voltage range will shrink, thus the voltage step between consecutive LSW states need to be reduced, e.g., to 10 mV.
Recently, programming voltage as low as 1.8 V has been experimentally demonstrated for FeFET by engineering the device structure and by adding a middle metal layer between the ferroelectric capacitor and MOS capacitor [25] . Therefore, I/O transistors at the 32-nm node can be applied to support the 1.8-V programming voltage. The gate length of I/O transistors at the 32-nm node is about 150 nm, and the width is about 300 nm. As C gd of the I/O transistor is reduced, the size of FeFET can be scaled down to L = 0.56 µm, W = 1.13 µm, assuming that the ratio between C g and C gd remains the same. If the programming voltage of FeFET can be further scaled down to 1-1.2 V, I/O transistors at the 14-nm node could possibly be applied. The projection of synapse size in terms of feature size F and total chip area is plotted in Fig. 5(a) and (b), respectively. For 2T1F-based synapse, both the synapse size and chip area are scaled down while for RRAM and FeFET synapse, only the chip area is scaled down as the cell size remains nearly the same in terms of F 2 . For the RRAM device, at the 14-nm node, the cell size slightly increased to 24F 2 to guarantee that the ON-resistance of the transistor is about 10% of the ON-state resistance (R ON ) of the RRAM. However, even at the 14-nm node, the cell size of the 2T1F synapse is still much larger than that of RRAM or FeFET synapse. The scaling trend for latency, energy consumption, and leakage are shown in Fig. 6 . No significant latency scaling is observed for all of these three synapses as the technology node scales down. It can be attributed to the fact that latency is limited by the write pulsewidth of the memory cell instead of the speed of periphery circuits. There is no evidence that the memory switching speed will scale with the memory cell size, as FeFET is field driven and RRAM is filamentary switching. The energy consumption for all of these three synapses shows significant reduction as the technology node scales from 65 to 14 nm because of less energy consumption in the periphery circuit and the scaling of write voltage for FeFET-based synapses. The leakage power for the 2T1F device is reduced significantly at the 14-nm node because of smaller I/O transistor width and lower VDD.
V. CONCLUSION
In this paper, the FeFET-based hybrid precision cell using multi-domain polarization switching is introduced. The array design for it is presented. The array-level benchmark is conducted by MLP + NeuroSim framework for both 2T1F and 3T1C + 2PCM synapses. The results show that both 2T1F and 3T1C + 2PCM synapses can improve the online learning accuracy due to the improved device linearity and extended dynamic range. However, the area cost is increased significantly as micrometer-scale multilevel switching is reported for FeFET and I/O transistor is needed to pass the high programming voltage for domain switching. It is expected that the area and performance of 2T1F design could be further improved when the sub-50-nm scale multilevel switching is proven. This is feasible as only 2-bit HSW is needed from the multi-domain polarization switching in ferroelectric gate materials. Another critical factor to enable scaling is to reduce the programming voltage of FeFET, and thus, I/O transistor at smaller technology node could be used.
