The implementation of highly anticipated hardware neural networks (HNNs) hinges largely on the successful development of a low-power, high-density, and reliable analog electronic synaptic array. In this study, we demonstrate a two-layer Ta/TaO x /TiO 2 /Ti cross-point synaptic array that emulates the high-density three-dimensional network architecture of human brains. Excellent uniformity and reproducibility among intralayer and interlayer cells were realized. Moreover, at least 50 analog synaptic weight states could be precisely controlled with minimal drifting during a cycling endurance test of 5000 training pulses at an operating voltage of 3 V. We also propose a new state-independent bipolar-pulse-training scheme to improve the linearity of weight updates. The improved linearity considerably enhances the fault tolerance of HNNs, thus improving the training accuracy.
Introduction
Neuromorphic computing is aimed at developing a more efficient computing architecture that mimics biological neural networks (NNs) present in the neocortex of the mammalian brain [1] . Inspired by the structure of biological brains, computer scientists have successfully built adaptive, massively parallel, fault-tolerant artificial NNs on high-performance supercomputers. Despite the great technological achievements in artificial intelligence systems such as IBM Watson [2] , Facebook DeepFace [3] , and Google DeepMind AlphaGo [4] , software-based neuromorphic computing currently faces several fundamental limitations, including scalability in power consumption, form factor, and cost [5] ; this is because such outstanding computing capabilities rely heavily on centralized, expensive data centers. Developing hardware NNs (HNNs) on integrated circuits with sufficient scalable power, form factor, and cost would open up unprecedented applications of ubiquitous and distributed neuromorphic computing. Neural systems process and store information simultaneously by modulating the connection strength, termed synaptic weight, in a considerably parallel network. The biological NN in the 4 mm thick neocortex of a typical human brain contains approximately 10 billion neurons and 100 trillion synapses [6] in a surface area of 0.24 m 2 [7] . On average, a neuron three-dimensionally connects to another 10 4 neurons [8] through two-terminal synapses. The synaptic weight is modulated by ions through a series of chemical reactions, triggered by either potentiating or depressing neural signals following appropriate biological processes such as spike-timing-dependent plasticity [9] . The lack of an ideal synaptic device in the present Si CMOS technology is the major obstacle to realize HNNs, and this has inspired active research on low-power, analog electronic synapses that might emulate both the functionality and two-terminal structure of a biological synapse [5] .
Among various electronic synapse candidates, resistive random access memory (RRAM) is one of the most promising. An RRAM-based synapse [10] [11] [12] [13] [14] [15] [16] [17] , resembling a biological synapse, is a two-terminal, nonvolatile resistor, of which analog conductance values can be modulated through complex ion-controlled mechanisms depending on neural signal strength levels. Ultralow pJ energy consumption levels per synaptic operation [12] , as well as the possibility of realizing three-dimensional (3D) HNN [14] have also been demonstrated recently. An RRAM-based HNN utilizes its crossbar array architecture to compute matrix-vector multiplication in parallel in learning algorithms such as convolutional NNs [18] and deep Boltzmann machines [19] . Furthermore, all intermediate values and weights in the network can be stored on chip in dense and compact arrays, which overcomes the input-output communication bottlenecks of large software-based NNs. Numerous neuromorphic applications such as pattern and audio recognition have been proposed [12, 13, 17] . Recently, we reported a series of experimental and theoretical studies of analog synaptic features in the Ta/TaO x /TiO 2 /Ti RRAM devices [20] [21] [22] [23] [24] [25] . Because this bilayer device applies a nonfilamentary conduction mechanism, which is drastically different from the conventional filamentary conduction mechanism [10] [11] [12] [13] [14] [15] [16] , it can overcome several limitations of conventional filamentary synaptic devices and shows promising properties for HNNs including concurrent inhibitory and excitatory synaptic plasticity, low synaptic conductance with minimal fluctuation, and ultralow energy consumption comparable to that of a biological synapse (<10 fJ/spike) [23] . A high-density and high-connectivity 3D cross-point synaptic array architecture was also proposed [23] . However, the characteristics of 3D synaptic arrays have yet to be evaluated. Another critical parameter of synaptic devices is the linearity of synaptic weight updates, which characterizes the linear response of cell conductance to consecutive potentiation or depression inputs. In software implementation, a linear relation is typically assumed, but experimental synaptic devices exhibit highly nonlinear characteristics [12, [14] [15] [16] [17] 23] , which have been demonstrated to impair the learning accuracy of RRAMbased HNNs [26] [27] [28] [29] .
To facilitate future high-density HNN applications, we fabricated a 3D two-layer cross-point synaptic array prototype based on an analog Ta/TaO x /TiO 2 /Ti device. The 3D array exhibits uniform and reproducible synaptic characteristics among intralayer and interlayer cells. At least 50 analog synaptic weight states could be precisely controlled with minimal drifting during a cycling endurance test of 5000 training pulses. We further examined the training voltage amplitude and pulse time dependence and noted an exponential voltage-time (V-t) dependence that leads to a design trade-off between power and speed. Moreover, to eliminate the undesirable nonlinear characteristics of the weight update associated with the conventional unipolar pulse-training scheme (UP-scheme), we propose a new bipolar pulse-training scheme (BP-scheme) that applies an additional offset pulse with opposite polarity to eliminate the rapid conductance change at the first training pulse. Compared with the UP-scheme, the BP-scheme demonstrated higher accuracy in our pattern training simulation because of the system's improved fault tolerance.
Experimental details
An optical microscope image of the fabricated 3D two-layer Ta/TaO x /TiO 2 /Ti cross-point synaptic array is shown in figure 1(a) . This array resembles the 3D neural connection in the human brain, as illustrated in figure 1(b) . Figure 1(c) illustrates a cross-sectional schematic of a 3D cross-point array involving two Ta/TaO x /TiO 2 /Ti synapses (top cell and bottom cell) located on the sidewall of a multilayer pillar structure, as indicated in figure 1(d) . A four-mask process flow was utilized in the fabrication. First, a multilayer stack of SiO 2 /Ti/SiO 2 /Ti/SiO 2 was deposited on a Si wafer. Each layer of SiO 2 and Ti was 100 nm thick. SiO 2 was deposited by plasma-enhanced chemical vapor deposition; Ti was deposited by dc magnetron sputtering using a Ti (99.99%) target. A pillar structure was patterned on the multilayer stack by photolithography and dry etching using HBr and CCl 4 . After the pillar formation, only the sidewalls of the Ti layers were exposed to serve as two separate horizontal electrodes that were cladded with SiO 2 isolation layers. Subsequently, a 30 nm thick TiO 2 layer and 10 nm thick TaO x layer were deposited sequentially by reactive dc magnetron sputtering by using Ti (99.99%) and Ta (99.95%) targets, respectively. The ratios of Ar:O 2 gas flow during sputtering were 2:1 for TiO 2 and 20:3 for TaO x . A 250 nm thick Ta layer was deposited on the pillar sidewall by using dc magnetron sputtering and a Ta (99.95%) target, and the Ta vertical electrodes were patterned through photolithography and liftoff processes. The effective device area between the cross point of the Ti horizontal electrode and the Ta vertical electrode at the pillar sidewall was 20 μm 2 . Finally, the contacts to the bottom and top Ti horizontal electrodes were patterned and etched using two additional masks. Electrical measurements were performed using an Agilent B1530A waveform generator/fast measurement unit at room temperature. A bias voltage was applied on Ta while the Ti horizontal electrode was grounded. show the synaptic potentiation and depression characteristics determined by measuring the device conductance by using a read pulse at −1.5 V for 1 ms from the top and bottom cells, respectively. Starting from the same initial conductance of 80 nS, the device exhibited a total synaptic plasticity (i.e., range of conductance change) of approximately 50 nS within one cycle of synaptic operation. Here, one cycle of synaptic operation comprised 100 training pulses: 50 consecutive potentiation pulses (P-pulses) at 3 V for 5 ms followed by another 50 consecutive depression pulses (D-pulses) at −3 V for 5 ms, where the increment (or decrement) of the device conductance is referred to as potentiation (or depression). The results indicate that 50 analog synaptic weight states can be precisely controlled (trained) according to P/D-pulse inputs. The cell conductance returned to its initial value after equal numbers of P-pulses and D-pulses were received. This device shows concurrent inhibitory and excitatory synaptic plasticity with minimal fluctuation, which is inherently different from the conventional filamentary synapses [10] [11] [12] [13] [14] [15] [16] . The cell conductance of the filamentary synapses is modulated by connecting and rupturing a small number of defects in a localized nanoconstriction gap [30, 31] . The conductance modulation in our device can be explained by the homogeneous barrier modulation of TaO x induced by uniform oxygen ion migration under a bipolar electric field [20] [21] [22] [23] [24] . Because the current conduction is homogeneous, the device is less susceptible to the localized randomness of ion migration. The cell conductance is scaled proportionally with the device area [21, 22] , which implies that a high-density synaptic array with low cell conductance (i.e., low power consumption) can be realized by scaling the device area. Furthermore, similar synaptic characteristics were observed in both the top and bottom cells, suggesting the density of the synaptic array can be significantly increased by stacking more Ti layers in the vertical direction. The memory industry has mastered a similar vertical integration scheme in 3D NAND Flash memory with more than 48 stacking layers [32] . Additionally, the device demonstrated excellent endurance during repeated synaptic operating cycles with a total of 5000 training pulses ( figure 2(c)) . A RRAM memory cell with a film stack similar to this one has been shown to endure for more than 10 12 programming cycles [20] [21] [22] , one of the highest endurance ever reported for RRAM. Stable cycling endurance with minimal variation is essential for realizing reliable synaptic weight updates in an HNN.
Results and discussion
The long pulse width (measured in milliseconds) for weight updates in this synaptic device is comparable to that in biological NNs. The human brain achieves highly efficient, low-power computation through parallelism even with relatively slow, millisecond-long neural spikes and an average frequency of 10-20 Hz [5] . The low frequency of neural spikes is the outcome of evolution, and it is optimized for the finite metabolic rate that a living organism can support. For example, the human brain has a power budget of only 10 W. By contrast, the power constrain in artificial neuromorphic hardware is application dependent. Many neuromorphic applications with a power budget greater than 10 W may benefit from a shorter P/D-pulse for faster computation. In RRAM devices, a faster programming speed can be achieved using a higher programming voltage (i.e., higher power by assuming the same cell conductance); this is termed the voltage-time (V-t) dilemma [33, 34] . Figure 3 shows the device exponential V-t dependence, obtained by varying the pulse width and pulse amplitude and maintaining a synaptic plasticity similar to that in figure 2. A pulse width as short as 50 μs at 9 V or a pulse amplitude as low as 1.5 V for 100 ms can achieve comparable characteristics. The exponential V-t relation might be explained by ion migration over local potential wells [24, 33] ; it serves as a useful design guideline for the trade-off between speed and power in HNNs. Notably, the energy consumption per synaptic spike (i.e., V 2 ×t times the cell conductance) in our device is lower by increasing the programming voltage because of the highly nonlinear exponential V-t relation. For ultimate hardware acceleration, future research should explore even more efficient synaptic devices with a fast programming speed of nanoseconds and a logic-compatible programming voltage of less than 3 V.
3D synaptic array and linearity tuning of weight updates
Figure 4(a) shows the normalized synaptic characteristics of 16 devices in a 4 × 4 3D cross-point array by applying the conventional UP-scheme. The conventional UP-scheme, identical to that used in figure 2 , consists of identical square P-pulses of 3 V/5 ms or D-pulses of −3 V/5 ms. Although the synaptic responses of different devices in the array were reasonably uniform, the weight updates were highly nonlinear for both potentiation and depression. The change of cell conductance was more dramatic during the first few P/Dpulses and became saturated as the number of pulses increased. If every training pulse results in a different response in weight update depending on the current weight state, the cumulative effect on weight update does not follow simple addition and subtraction rules that are typically assumed in software-based NNs. For example, as shown in figure 4(a) , the synaptic weight of a cell undergoing 25 Ppulses was not the same as that of a cell undergoing 50 Ppulses plus 25 D-pulses; this is attributed to thenonlinearity (NL) of the weight update. In this study, NL was defined quantitatively as:
where G P (n) and G D (n) are the conductance values after the nth P-pulse and nth D-pulse, respectively. The values are normalized to the total plasticity and range from 0 to 1 during an update sequence comprising an equal number (N) of consecutive P-pulses and D-pulses. For a completely linear update, NL is equal to zero, as shown by the linear dashed line in figure 4(a) . In our measurement, when N=50, NL was 0.6-0.81 at n=25. High NL values have also been reported in other RRAM-based synaptic devices [12, [14] [15] [16] [17] 23] , and have been successfully simulated by statedependent oxygen ion migration [24] . At the first P-pulse, O 2− moves rapidly from the TaO x bulk toward the Ta/TaO x interface. The conductance change becomes highly gradual because charge accumulation reduces the internal electric field and suppresses further O 2− migration toward the interface. A similar phenomenon can also be cited to explain the nonlinear depression curve. Because nonlinear characteristics degrade learning accuracy [26] [27] [28] [29] , various approaches for optimizing device designs and P/D-pulse-training schemes have been proposed to reduce NL [26] . One of the proposed P/D-pulse-training schemes applies nonidentical training pulses with a state-dependent pulse width and pulse amplitude to achieve linear weight updates [26] . However, such state-dependent nonidentical training pulses may considerably increase the complexity of peripheral circuit designs because the current state must be recorded in memory units outside the synaptic arrays or verified before every training pulse. Therefore, we propose a new BP-scheme ( figure 4(b) ) that entails utilizing only state-independent identical training pulses. In the BP-scheme, each main P/D-pulse of ±3 V/ 5 ms is followed by an additional weaker offset pulse of m2 V/2 ms with opposite polarities. Compared with the conventional UP-scheme, the proposed BP-scheme demonstrated a considerably improved NL of 0.42-0.54 with comparable, if not superior, uniformity across different cells in the 4×4 3D cross-point array. Figure 5(a) illustrates the synaptic potentiation and depression characteristics of the device with different P/D-pulse numbers (N=30, 50 and 70) measured using both the UP-and BP-schemes. The number of analog weight states and total plasticity can be increased by increasing the number of P/D-pulses in both schemes. The conventional UP-scheme showed a noticeable rapid change of conductance at the first P/D-pulse, which led to a high NL. By contrast, the BP-scheme effectively eliminated the rapid change through its additional offset pulse. Although the detailed mechanism is still under investigation, we presume that the weak pulse with opposite polarity offsets the initial rapid charge accumulation at the Ta/TaO x interface (TaO x bulk) during the first P-pulse (Dpulse) but still allows gradual charge build-up for a more linear weight update. The engineered offset pulse shifted the synaptic curve nearly in parallel through an offset value approximately equal to the amount of conductance change at the first P/D-pulse of the UP-scheme ( figure 5(b) ). The evolution of the nonlinear weight update as a function of training pulse can be expressed as: Figure 4 . Normalized synaptic characteristics of 16 devices in a 4×4 3D cross-point array using (a) the conventional UP-scheme and (b) the proposed BP-scheme. The P-pulse and D-pulse waveforms are also illustrated. The conductance was read using a −1.5 V/1 ms read pulse, and the values are normalized to the total plasticity from 0 to 1. The dashed line represents a completely linear update. The extracted nonlinearity values were calculated according to equation (1) . The nonlinear weight update can be significantly improved using the BPscheme.
where G initial is the initial conductance before training, G offset is the offset value induced by the BP-scheme, and G 1 , G 2 , τ 1 , and τ 2 are empirical fitting parameters. Figure 6 (a) presents the measured potentiation curves for both schemes and the fitting results derived using equation (2) with different G offset values, where the conventional UP-scheme was fitted using G offset =0 and the BP-scheme was fitted using G offset =27 nS, with the other parameters remaining the same. Figure 6 (b) shows the normalized potentiation and depression characteristics associated with different offset values. In this study, we assumed a symmetric potentiation and depression behavior for simplicity. Figure 6 (c) shows the NL value as a function of G offset . Clearly, NL can be reduced by increasing G offset through the proposed BP-scheme.
Simulation result
To investigate the impact of nonlinear weight updates on training accuracy, we simulated the training evolution of an 8×8 binary alphabetic 'B' pattern [35] by using the experimental synaptic characteristics presented in figure 4 . To evaluate the fault tolerance of the training process, the initial conductance values in 8×8 synapses were randomized, and 10% error bits were present in each 8×8 training pattern. Each synapse was either potentiated or depressed according to the 1 (black) or 0 (white) information in each training pattern. Ideally, the synaptic weights in the 8×8 synapses would be trained to learn the binary alphabetic 'B' pattern, regardless of the initial cell variability and input noise. 
where G target (i) and G trained (i) are the target and trained conductance, respectively, at the ith synapse in a system with a total of m synapses, and the conductance values are normalized to the total plasticity from 0 to 1. The root mean square in equation (3) calculates the average discrepancy between the ideal alphabetic 'B' pattern and the trained weight map. The calculated accuracy as a function of the training cycle is illustrated in figure 7(b) . The ideal linear update scheme achieved nearly 100% training accuracy in 60 cycles. Notably, during the first few training cycles, the nonlinear weight update resulted in a higher initial training accuracy because of the rapid conductance change (i.e., the information was learned more quickly). However, it was also highly susceptible to input noise, which is always present in adaptive neuromorphic computing with imprecise input data. The less frequent noise can easily overwrite the more frequent information, and the fault-tolerance of the HNN can be compromised. Therefore, the BP-scheme with a lower NL exhibited a significant improvement in the final training accuracy, which maximized at approximately 90%, compared with the UP-scheme.
Conclusions
We fabricated and characterized a 3D two-layer Ta/TaO x / TiO 2 /Ti cross-point synaptic array. The array emulates the high-density and high-connectivity NN in the neocortex of the human brain, and it demonstrates excellent uniformity and reproducibility among intralayer and interlayer cells. We also described several aspects of HNN design considerations by using our experimental results. The design trade-off between power and speed of training complied with an exponential V-t dependence. Nonlinear weight updates compromised the fault tolerance of neuromorphic computing in a training process that considered device variability and imprecise data input. We also propose a BP-scheme, and this scheme significantly improved the NL of weight updates, consequently improving the training accuracy. Training accuracy, defined by equation (3), as a function of the training cycle. The training accuracy of the UP-scheme increased more rapidly at the initial training stage but saturated earlier because of poor immunity to input noise. The BP-scheme with lower nonlinearity showed a significant improvement in final training accuracy.
