A neuro-inspired computing paradigm beyond the von Neumann architecture is emerging and it generally takes advantage of massive parallelism and is aimed at complex tasks that involve intelligence and learning. The cross-point array architecture with synaptic devices has been proposed for on-chip implementation of the weighted sum and weight update in the learning algorithms. In this work, forming-free, silicon-process-compatible Ta/TaO x /TiO 2 /Ti synaptic devices are fabricated, in which >200 levels of conductance states could be continuously tuned by identical programming pulses. In order to demonstrate the advantages of parallelism of the cross-point array architecture, a novel fully parallel write scheme is designed and experimentally demonstrated in a small-scale crossbar array to accelerate the weight update in the training process, at a speed that is independent of the array size. Compared to the conventional row-byrow write scheme, it achieves >30× speed-up and >30× improvement in energy efficiency as projected in a large-scale array. If realistic synaptic device characteristics such as device variations are taken into an array-level simulation, the proposed array architecture is able to achieve ∼95% recognition accuracy of MNIST handwritten digits, which is close to the accuracy achieved by software using the ideal sparse coding algorithm.
Introduction
Conventional von Neumann digital computers suffer from two major issues: one is the physical scaling limits of the complementary metal-oxide-semiconductor (CMOS) technology; the other is the low efficiency as compared to biological systems in dealing with complex, real-world environment problems [1] . Alternative approaches such as neuro-inspired architecture, with distributed computing and localized storage in neural networks, have attracted great attention due to their tolerance of fault and error, and massively parallel computation [2] . The primary goal of neuroinspired computing is to develop application-specific hardware for tasks such as visual/sensory data classification and autonomous robotics. In recent years, progress has been made in building large-scale neuromorphic hardware, such as SpiNNaker [3] , FACETs [4] , Caviar [5] and TrueNorth [6] . Conventional CPU/GPU-based systems are inadequate for fast online training with a huge dataset, having limitations in their power consumption and scalability. A custom-designed CMOS chip with SRAM arrays as the weight matrix has shown advantages over CPU/GPU [6] , but it still has limitations such as binary bit storage and a sequential write/read process. To achieve further speed-up, one promising approach is to realize a fully parallel write/read using a cross-point array architecture [7, 8] , where each cross-point is implemented with a resistive synaptic device.
A resistive synaptic device [9, 10] is a type of resistive random access memory (RRAM) [11, 12] or memristor [13] that exhibits analog memory functionality with multilevel tunable weights, which emulates the biological synapse in a neural network. Prior works have explored material systems based on oxides and chalcogenides for synaptic devices [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] , and most of these works focused on characterization of the behavior of individual devices, such as multilevel capability, spike-time-dependent plasticity, and short-term and long-term potentiation. Recently, there has been significant progress in the fabrication of crossbars (or cross-point arrays) and their integration with CMOS circuits [27] , including several array-level demonstrations of weight update for neuro-inspired computing [28] [29] [30] [31] . However, in these demonstrations, updating the synaptic weights in a large-scale array is very time-consuming, since row-by-row, or even bitby-bit, sequential operation was used. Moreover, the multilevel programming in these works generally depends on a feedback loop and application of voltage pulses that vary in amplitude and/or width, which is undesirable for online training and also adds significant complexity to the peripheral circuit design. Reference [29] demonstrated pattern classification with insitu training that realizes a parallel weight update; however, the 20 individual devices have to be manually wired in a 10×2 crossbar circuit because of the one-time forming process with high voltage. In this work, we fabricate forming-free synaptic devices based on a Ta/TaO x / TiO 2 /Ti stack that require no current compliance. Compared to our prior work [26] , we shrink the oxide thickness in this work, and thus the programming voltage reduces from 9 V to 3 V. The conductance of these devices can be continuously tuned by varying only the number of the identical write pulses. Owing to the smooth change of the conductance state, feedback is not required to tune the device to the target state in the device's dynamic range. In addition to the improvement of the device characteristics, we also propose a novel programming scheme to fully parallelize the weight update process in one shot [32] , aiming at a speed that is independent of the array size. As a case study, the weight update process implements the delta rule in the sparse coding algorithm [33, 34] , which is a class of unsupervised learning algorithms to extract sparse features from dense input signals. We experimentally demonstrate the proposed scheme in a smallscale crossbar array with the Ta/TaO x /TiO 2 /Ti synaptic devices. Figure 1 shows a schematic diagram of the synaptic cross-point array that holds the dictionary values (D, weight matrix), and connects Z neuron nodes (sparse feature representation on outputs) and R neuron nodes (residual error of data representation on inputs). The delta rule requires that all the D values stored in the cross-point array update by an amount proportional to the product of Z and R (i.e., Z·R). We perform a proof-of-concept experiment on the fabricated 2×2 crossbar array to demonstrate that the entire cross-point array could be programmed in parallel, and thereby the programming speed is not limited by the array size. Besides the weight update (write) process, we also experimentally demonstrate the weighted sum (read) process on this array. As a further step, we consider the challenges to scaling up the array size,such as device variations, through theoretical modeling. Projection of the latency and energy consumption to a large-scale (100×300) network shows the great benefits of the proposed fully parallel scheme over the conventional row-by-row scheme.
Experiment
A Ta/TaO x /TiO 2 /Ti resistive synaptic device was built vertically at the sidewall with 20 μm 2 active switching area, as shown in the inset of figure 2(a). First, a Ti plane electrode (100 nm) and SiO 2 isolation layers (100 nm) were deposited on 500 nm SiO 2 substrates by using DC magnetron sputtering and plasma-enhanced chemical vapor deposition, respectively. Vertical pillar structures were patterned using a photolithography process, followed by dry etching until SiO 2 substrates were reached. Then, TaO x (10 nm)/TiO 2 (30 nm) bilayers were sequentially deposited by reactive magnetron sputtering using Ta and Ti targets, respectively. Additional photolithography and dry etching steps were used to open the contacts to the Ti plane electrode. Finally, the Ta vertical electrodes were deposited by using DC magnetron sputtering, following by a lift-off process. The vertical TaO x /TiO 2-based cells are formed at the sidewall between the Ti pillar electrode and Ta plane electrode with an active switching area of 20 μm 2 . All the processes were performed at room temperature and no noble metals were used for silicon CMOS process compatibility. The fabrication procedure was close to that described in reference [26] , with the major difference of shrinking the oxide thickness, thereby successfully lowering the programming voltage from 9 V to 3 V. The experimental setup consisted of a Keithley 4200 semiconductor characterization system with four sourcemeasure units (SMUs) and a Cascade Microtech 12000-series probe station. The 4200 SMUs are used for the quasi-DC I-V measurement, creating arbitrary waveforms down to 5 ms pulse width, and current sensing as well. The four SMUs are connected to the 2×2 array via the Cascade Microtech 12000-series probe station with four DCM 200-series precision probes. The Keithley User Library Tool (KULT) is used to create test modules with a C program and they are executed by the Keithley Interactive Test Environment (KITE).
Results and discussion

Individual device characteristics
Figure 2(a) shows the typical bipolar resistive switching I-V characteristics of the individual synaptic device. The as-fabricated device is highly resistive, and can be programmed without electrical forming or current compliance, which is probably due to the homogeneous (nonfilamentary) barrier modulation mechanism [35] . Unlike typical RRAM devices [11, 12] that could only exhibit gradual resistance change in the reset process (from a low resistance state to a high resistance state) while the set process (from a high resistance state to a low resistance state) is abrupt, here the synaptic device shows that the states can be gradually increased and decreased in the full dynamic range and no compliance current is required, which is crucial for implementing the weight update on-chip. Figure 2 (b) shows that the incremental set transition is obtained by applying a set voltage of 3 V with ten consecutive sweeps. Similarly, an incremental reset transition is obtained by applying reset voltage of −3 V with ten consecutive sweeps. The switching mechanism of the Ta/TaO x / TiO 2 /Ti resistive synaptic device is different from that of typical filamentary RRAM devices; an interface model was proposed to elucidate the observed gradual set/reset phenomenon involving a homogeneous barrier modulation effect with oxygen ion migration between the oxide and the oxide boundary [35] .
In the following, we investigate the programming of our synaptic device in the pulse mode. In many reported synaptic devices [14, 16, 18, 24, 36] , multilevel states were achieved by varying the pulse amplitude [16, 18, 24] or the pulse width [14], or by using a feedback scheme [36] that applies iterative write and read pulses to converge to a desired state, which adds significant complexity to the design of peripheral circuits [37] . Therefore, programming with consecutive identical pulses is preferred. In our device, synaptic weight potentiation is implemented by identical pulses of +3 V with 40 ms pulse width while synaptic weight depression is implemented by identical pulses of −3 V with 10 ms pulse width, and a −1.5 V read pulse is applied after each write pulse to check the states. It should be pointed out that a smaller read voltage should be used because a −1.5 V read pulse runs the risk of disturbing the device conductance in our device. Nevertheless, a small read voltage results in a much smaller dynamic range of conductance due to the I-V nonlinearity in our device. To maximize the dynamic range of conductance, we will use −1.5 V as the read voltage in all the following testing. Figure 2(c) shows that the conductance could be continuously modulated by varying the number of write pulses, and 200 levels of states could be precisely tuned (see the zoom-in in the inset of figure 2(c)). Compared to the reported devices [14, 20] , the nonlinearity and temporal variations of the weight update curve (conductance versus number of pulses) are significantly suppressed, which enables us to tune the conductance to any particular state requiring no feedback. Furthermore, the reproducibility of this analog weight update was confirmed through an endurance cycling test. Figure 2(d) shows the reliable cycling capability after ten training epochs, and each epoch includes 105 identical positive pulses and 105 identical negative pulses.
Fully parallel write/read operation
Conventionally, programming a cross-point array is performed sequentially row by row or even bit by bit. This is a time-consuming process, especially on a large-scale array. To exploit the parallelism of the cross-point array, we designed a fully parallel programming scheme that speeds up the weight update in the entire array. The delta rule of weight update in many learning algorithms requires the synaptic weight update (ΔD) to be proportional to the value of the pre-synaptic Z neuron nodes and that of the post-synaptic R neuron nodes (ΔD∝Z·R). In our scheme, for a given programming time window, the Z nodes generate pulses with a duty cycle proportional to the Z strength, the R nodes generate a pulse train (fixed pulse width) where the number of pulses is proportional to the R strength. Whenever the R pulse (V R ) is overlapped with the Z pulse (V Z ), it creates a net voltage drop (V R -V Z ) across the device. Thus, the accumulated overlap time of these two pulses indicates the product Z · R. In the algorithms, R can be either positive or negative to increase or decrease the weight (Z is always positive). Therefore, the programming time window is divided into a positive period for R>0 and a negative period for R<0. Assuming that the programming voltage is V dd , a requirement of the design is that V dd /2 should not disturb the device conductance. In the positive period (R>0), the effective programming pulse of Z (V Z ) is 0 for a certain amount of active time window proportional to Z while V R is equal to V dd . In this active time window, the voltage across the device is V dd to increase the weight. After this active time window, V Z is switched to V dd /2, thus the voltage across the device switches to V dd /2 to prevent further programming. Similarly in the negative period (R<0), the effective programming pulse of Z (V Z ) is V dd for a certain amount of active time window proportional to Z while V R is equal to 0. In this active time window, the voltage across the device is −V dd to decrease the weight. After this active time window, V Z is switched to V dd /2, thus the voltage across the device switches to −V dd /2 to prevent further programming. The peripheral circuits that generate such Z and R pulses can be referred to this design [32] . Figure 3(a) shows an example of the waveform for Z 1 =1, Z 2 =0.5, |R 1 |=16 and |R 2 |=8. As a proof-of-concept demonstration, we built a small 2×2 array and used it for the fully parallel write/read measurement, as shown in figure 3(b) . First, the four devices (D 1 , D 2 , D 3 and D 4 ) were randomly programmed to the initial states. For potentiation, pulses of either 0 or 1.5 V with different duty cycles that represent Z 1 =1, Z 2 =0.5 are simultaneously applied on the two rows, and a number of pulses of 3 V that represent R 1 =16 and R 2 =8 are simultaneously applied on the two columns; the generation of these pulses is synchronized when applying on the 2×2 array. The overlap of Z and R in D 1 , D 2 , D 3 and D 4 is
, respectively. For depression, pulses of either 3 or 1.5 V with different duty cycles that represent Z 1 =1, Z 2 =0.5 are simultaneously applied on the two rows, and a number of pulses of 0 V that represent R 1 =−16 and R 2 =−8 are simultaneously applied on the two columns; the overlap of Z and R in D 1 , D 2 , D 3 and D 4 is Z 1 · R 1 =−16, Z 1 · R 2 =−8, Z 2 · R 1 =−8 and Z 2 · R 2 =−4, respectively. For each R pulse, a fixed pulse width (i.e., 100 ms pulse width for potentiation, 30 ms pulse width for depression) was used. A monitor pulse of −1.5 V was applied immediately after each write pulse to read the conductance. Figure 3(c) shows that the conductances of D 1 , D 2 , D 3 and D 4 have been successfully updated in parallel, and ΔD 1 ≈2ΔD 2 ≈2ΔD 3 ≈4ΔD 4 ; this slight deviation from ideal target conductance is probably due to nonlinear weight update. The decay (green parts in figure 3(c) ) in D 2 and D 4 is due to the disturbance from the monitor pulses, which can be effectively suppressed after removing them as shown by the brown circles in figure 3(c) . Nevertheless, further device engineering such as optimizing oxide stack thickness is required to improve the device's retention property against the V dd /2 disturbance effect. It is also noted that the proposed write scheme is more suitable for stochastic update rules (i.e., online training). For batch-mode update (i.e., offline training), the weights are pre-trained on a software-based network and then loaded sequentially into the array. Batch mode is generally superior in terms of learning accuracy, but it requires a large-scale dataset available for training before the runtime. The offline training does not require a smooth weight update of the device characteristics, and iterative programming scheme with feedback control (such as proposed in [38] ) can be used because the programming speed is not critical.
Besides the weight update, the weighted sum is one of the most frequent operations in neural networks. The weighted sum (matrix-vector multiplication, D·Z) is done by a parallel read operation of the cross-point array. We performed the parallel read operations in the 2×2 array. First, the four devices are programmed to a specific conductance state using the aforementioned fully parallel programming scheme. Then a read voltage of −1.5 V is applied in parallel to each row for every non-zero element of the Z vector, as shown in figure 1 . Then, the read voltage is multiplied by the conductance (D) of the synaptic device at each point, and the current output from each column performs the weighted sum. The programmed conductance of each device in the 2×2 array and measurement results of the weighted sum are shown in table 1. The average error between the measured and theoretical values is <2%. It is worth noting that this scheme is immune to the sneak path problem [39] in the conventional cross-point memory array. The memory array reads stored data by bit or by row, while the weighted sum operation reads the entire array in parallel. Thus, the readout is not affected by a sneak path through the unselected cells and actually the synaptic array employs the Kirchhoff law to do the analog computation.
Device non-ideality and array scaling-up trend
We have demonstrated a promising Ta/TaO x /TiO 2 /Ti synaptic device and 2×2 array; however, the device characteristics are not perfect as presumed in software-based learning algorithms and challenges may arise when scaling up Experimental results of the fully parallel weight (D) update using the productZ·R. D increases in the first half of the write period (R>0) and it decreases in the second half (R<0). The values of Z 1 , Z 2 , R 1 and R 2 are 1, 0.5, ±16 and ±8, respectively, as shown in (a). Thus, ΔD 1 ≈2ΔD 2 ≈2ΔD 3 ≈4ΔD 4 . The decay (green parts) in D 2 and D 4 is due to the disturbance from monitor pulses, which can be improved after the monitor pulses are removed (see data shown as brown circles). 
the array size. From the learning algorithm's point of view, ideal behavior of a synaptic device typically assumes a linear update of the weight with the input stimulus, e.g. the number of pulses. However, in reality a nonlinear behavior of potentiation and depression exists in many synaptic devices [14, 20] including ours. Figure 4 shows the realistic Ta/ TaO x /TiO 2 /Ti synaptic weight update used for the algorithm, where the data are extracted from figure 2(d) with the minimum conductance representing zero weight value. By applying positive pulses and negative pulses on the device, potentiation and depression can increase and decrease the synaptic weight, respectively. It can be observed that a nonlinear behavior of potentiation and depression exists in the synaptic devices. These nonlinear potentiation and depression curves may have different nonlinearities from device to device, which is referred to as the device-to-device variation and it is defined as the variation from the baseline of the nonlinear curve. Moreover, we also consider limited precision of the synaptic weight for on-chip implementation. It can be seen in figure 4 that the synaptic weight is tuned by a discrete number of input voltage pulses, indicating that the precision of the synaptic weight is limited. In addition, device-to-device variations commonly exist in resistive synaptic devices. To evaluate the neural network's robustness to these realistic device properties, we modeled weight update curves of our devices and incorporated them into sparse coding algorithms [33] . Detailed discussions of the sparse coding learning algorithm can be found in [40] . We can use a general behavior of weight potentiation (W P ) and weight depression (W D ) with the number of pulses (P) that is described with the following equations:
where W max , W min and P max can be directly extracted from the experimental data and represent the maximum weight, minimum weight and the maximum number of pulses required to switch the device between the minimum and maximum weights. A is the parameter that controls the nonlinear behavior of the weight update, and B is simply a function of A that fits the functions within the range of W max , W min and P max . As shown in figure 4 , the fitting curves for potentiation and depression have W max =1, W min =0.04 and P max =110, and A is also fitted to be 110. The device-todevice variation is measured to be around 9%. Compared to the software approach, an array with realistic synaptic devices shows only ∼2% degradation in the accuracy of recognizing the MNIST handwritten digits [41], as shown in figure 5(a) . It is known that, with supervised multilayer deep learning, the state-of-the-art recognition accuracy of the MNIST dataset is >99% [42] . In our case, it is a single-layer unsupervised network, and also considering the non-ideality of the device properties, ∼95% recognition accuracy of the MNIST dataset with our fully parallel write/read scheme is reasonable. Finally, we further benchmarked the performance of the proposed fully parallel write scheme with the conventional row-by-row write scheme as the array size is scaled up. There are typically two update schemes for synaptic weight update. The first one is the fully parallel weight update scheme that is proposed in this work. The other one is the conventional rowby-row weight update scheme, where the synaptic weight matrix is updated sequentially from one row to another. To compare the performance of these two update schemes, we consider the latency, instant power and the energy consumption of the entire weight update process as the performance metrics. The evaluation will be based on the sparse coding algorithm.
3.3.1. Fully parallel weight update scheme. As shown in figure 3(a) , the latency is just T D +T P , where T D and T P represent the entire duration for R>0 and R<0 phase, respectively. The energy consumption for a write cycle can be calculated as here N c (N r ) is the number of columns (rows) in the array, V is the voltage pulse amplitude, R V and R HV are different resistances of synaptic devices biased at V and V/2, respectively, due to nonlinearity. t D and t P are single pulse widths for depression and potentiation, respectively. Z D and Z P are the active Z durations in R>0 phase and in R<0 phase, respectively. ρ P and ρ D are the average percentages of R<0 and R>0 elements in the Z ≠ 0 rows. The sum of ρ P and ρ D should be 1. SP is the sparsity of the Z vector, which represents the percentage of Z=0 rows. On the right-hand side of equation (4), the first term in the bracket is the energy 
r r =´´-+ẃ here the max() picks whichever is larger for the number of cells on average that undergo potentiation and depression for Z ≠ 0 rows, and the second term is the instant power of Z=0 rows. It happens at the moment when the cells at multiple Z ≠ 0 rows are being updated.
3.3.2. The row-by-row weight update scheme. In the row-byrow weight update scheme, only one row is selected at a time, leaving other rows unselected. The Z inputs of those unselected rows are biased at V/2. In this scheme the latency is increased by N r (1 -SP) times, which is N r (1 -SP) (T P +T D ), because those Z=0 rows can be skipped. The energy consumption will also increase, which can be expressed as The performance benchmark of the above two schemes is shown in figure 5(b) , where we use V=3 V, R V =2.15 MΩ, R HV =8.57 MΩ, ρ P =ρ D =0.5, SP=0.9, t D =2, t P =10, T P =176, T D =48, Z P =11 and Z D =3. The results of each performance metric are normalized by the performance value of the parallel one at N r =N c =10. Despite a slightly higher instant power, figure 5(b) shows that the parallel scheme exhibits >30× reduction in latency and energy than the row-by-row scheme when the array size is 100×300.
Conclusion
In this work, we report a resistive synaptic device based on a forming-free, silicon CMOS-compatible Ta/TaO x /TiO 2 /Ti structure, with 200 levels of conductance states continuously tunable by identical pulses. Then, we propose a fully parallel write and read scheme for the weight update and weighted sum operations for accelerating learning algorithms on-chip. The proposed schemes are experimentally demonstrated in a small-scale crossbar array. Compared to the conventional rowby-row write scheme, it achieves significant speed-up and improvement in energy efficiency. In addition, the learning algorithm shows a promising learning accuracy that is resilient to the non-ideality of the device . This work demonstrates the parallelism and robustness of the cross-point array architecture for implementing neuro-inspired algorithms on-chip. (a) Recognition accuracy of MNIST handwritten digits with the ideal and realistic device behaviors calibrated with our samples such as device-to-device variations, nonlinear weight update and limited precision, leading to only ∼2% accuracy loss. (b) Performance benchmark of the fully parallel and the row-by-row write schemes. Despite a slightly larger instant power, the parallel scheme exhibits >30× reduction in latency and energy when array size is 100×300.
