Abstract-The phenomenon of metal-insulator-transition (MIT) in strongly correlated oxides, such as NbO2, have shown the oscillation behavior in recent experiments. In this work, the MIT based two-terminal device is proposed as a compact oscillation neuron for the parallel read operation from the resistive synaptic array. The weighted sum is represented by the frequency of the oscillation neuron. Compared to the complex CMOS integrateand-fire neuron with tens of transistors, the oscillation neuron achieves significant area reduction, thereby alleviating the column pitch matching problem of the peripheral circuitry in resistive memories. Firstly, the impact of MIT device characteristics on the weighted sum accuracy is investigated when the oscillation neuron is connected to a single resistive synaptic device. Secondly, the array-level performance is explored when the oscillation neurons are connected to the resistive synaptic array. To address the interference of oscillation between columns in simple cross-point arrays, a 2-transistor-1-resistor (2T1R) array architecture is proposed at negligible increase in array area. Finally, the circuitlevel benchmark of the proposed oscillation neuron with the CMOS neuron is performed. At single neuron node level, oscillation neuron shows >12.5X reduction of area. At 128×128 array level, oscillation neuron shows a reduction of ~4% total area, >30% latency, ~5X energy and ~40X leakage power, demonstrating its advantage of being integrated into the resistive synaptic array for neuro-inspired computing.
INTRODUCTION
Implementation of brain-inspired neural networks with conventional CPUs/GPUs platforms based on sequential von Neumann architecture is computationally expensive and power hungry. Although several custom CMOS ASIC accelerators have been developed to implement these networks (e.g. IBM's TrueNorth [1] ), the SRAM based synaptic arrays still occupy the most of the silicon area. The SRAM's row-by-row operations are essentially sequential, further degrading the performance. Therefore, to improve speed and power efficiency, it is attractive to explore emerging nano-device technologies for the synapse devices, such as the resistive memory (RRAM) [2] . The resistive cross-point array has been proposed to perform the weighted sum and weight update operations in a neural network [3] [4] [5] , which are the most time consuming steps in learning and classification algorithms. The illustration of the weighted sum (or vector-matrix multiplication) operation is shown in Fig. 1(a) . When an input vector (of voltages) is fed into the cross-point array, the weighted sum current (modulated by the weight or conductance of each RRAM synapse) will be sink to the neuron node at the end of the column. Typically, the communications between arrays are via the spikes or the digital bits. A CMOS integrateand-fire neuron circuit is needed to convert the analog current to spikes (essentially as an analog-to-digital converter). The counter further counts the number of spikes and convert them into the digital bits. However, today's CMOS integrate-and-fire neuron typically requires tens of transistors. Fig. 1(b) shows a design example [6] . Such complex CMOS neuron causes the column pitch matching problem: multiple columns have to share one neuron, thereby reducing the parallelism as the timemultiplexing is needed to sequentially read out all the weighted sum from the array.
In this work, we aim to study the feasibility of a compact oscillation neuron using the metal-insulator-transition (MIT) device in order to replace the CMOS neuron. Prior RRAM designs [3, 4, 7] mostly focused on the synaptic array core instead of the peripheral neuron node. A recent experimental work demonstrated the oscillation neuron with small-scale synaptic array [8] , however, how to design a large-scale synaptic array with oscillation neuron remains unexplored. The contributions of this work include:
 Analyze the impact of MIT device characteristics and RRAM weights on the weighted sum accuracy with both simulation and analytical approaches.  Develop the weighted sum operation scheme on the cross-point array and discussed its vulnerability to the interference or crosstalk problem between columns, for which the 2T1R array is proposed as a solution.
Vspike Vspike

Vreset
Vin
Vspike Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. ICCAD '16, November 07-10, 2016, Austin, TX, USA Perform calibration on the read cycle time for accuracy improvement and Monte Carlo simulation on weighted sum operations showing the tradeoff between accuracy and read latency.  Benchmark the total area, latency, energy and leakage power of the oscillation neuron with the CMOS neuron at both sub-circuit and array level, showing significant advantages of the proposed oscillation neuron.
II. METAL-INSULATOR-TRANSITION PHENOMENON
The metal-insulator-transition (MIT) phenomenon occurs in strongly correlated oxides, where the oxides switch between a metallic state and an insulating state under certain external excitation, thermally or electrically [9] . The MIT device shows a threshold switching I-V characteristics with hysteresis and theoretically 2-5 orders of magnitude ON/OFF ratio. For the Mott transition in strongly correlated oxides, the bandgap collapses when the carrier density in the materials is larger than the critical carrier density nc, resulting in the insulator-to-metal transition. Carrier density in the materials can be increased by either thermal injection or electric injection. Therefore, the threshold switching has a critical temperature (TC) or a critical threshold voltage (VTH). Among all the Mott oxides, the research in the literature extensively focused on VO2 as the representative material system for studying the physical mechanism. However, VO2 is not suitable for on-chip integration because its TC ~67 o C [10] is relatively low, and the threshold switching behavior disappears above TC. What makes the circuit design more challenging is the fact that the VTH of VO2 strongly depends on the environment temperature even below TC. For this reason, we select an emerging material NbO2 with an extremely high TC ~808 o C [10] that has superior thermal stability. Recent experiments show the on-chip integration of NbO2 with the CMOS platform [11] .
The MIT device has been listed as an emerging device candidate in the ITRS roadmap for logic switch [12] , still lacking demonstrations to be competitive in practical applications. For example, the steep-slope field-effect transistor with strongly correlated oxides as the channel material suffers from the low carrier mobility [13] . The recent revival of MIT device is owing to its capability to serve as a two-terminal selector device for the cross-point memory array to suppress the sneak paths [11] . Different from these works, we propose to use MIT device as the oscillation neuron in neuromorphic computing. Using the coupled-oscillators as phase encoding for the computation-hard optimization problems have been proposed [14] [15] [16] , however here we take a different approach of using the oscillators: we utilize the oscillation as an integrateand-fire neuron's output waveform. Fig. 2(a) shows the hysteresis I-V characteristic of a typical MIT device [9] . We have built a Verilog-A behavior model to capture the switching characteristics with parameters such as the resistance in the ON/OFF state (RON/ROFF), the threshold voltage (VTH), and the hold voltage (VHOLD). The MIT device is initially in the OFF state, and it will switch to the ON state once the applied voltage exceeds VTH. When the applied voltage across the MIT device is smaller than VHOLD, it will switch back to the OFF state. Therefore, the resistive switching in the MIT device is essentially "volatile", unlike the "non-volatile" resistive switching in the RRAM.
The intrinsic transition time in the MIT device is defined as the time required to switch between RON and ROFF. To make the neuron node oscillate, we have to connect a resistor (e.g. an RRAM synapse) with the MIT device, as shown in Fig. 2(b) . We assume the RRAM resistance (R) is between MIT device's RON and ROFF, and there is parasitic capacitance at the neuron node. Initially when the voltage VDD is applied, the node voltage on the capacitor will be charged because most of the voltage drop should be on the MIT device (ROFF>R). Once the node voltage exceeds VTH, the MIT device switches to RON, and the capacitor starts discharging since the voltage drop on the MIT device becomes small (RON<R). Once the node voltage decreases below VHOLD, the MIT device switches to ROFF. This charging and discharging process repeats, thus the voltage of the neuron node oscillates between VHOLD and VTH. Fig. 2 (c) shows a SPICE simulation waveform for the circuit configuration in Fig. 2 (b) . As the charging is through the RRAM and the discharging is through the MIT device at RON, the RC delay of the charging is larger than that of the discharging, which makes the voltage oscillation a triangular waveform. The oscillation of the MIT device in such circuit configurations has been widely observed in various experiments [8, [17] [18] [19] [20] , showing its feasibility as the oscillation neuron. By solving the Kirchhoff's Law of Fig. 2(b) , the analytical solution of the charging time trise from VHOLD to VTH can be obtained, which is expressed as
where R r =(R||R OFF ). Similarly, the discharging time from VTH to VHOLD can be calculated as:
where R f =(R||R ON ). If we assume ROFF>>R>>RON, then R r ≈R and R f ≈R ON , which makes trise proportional to RRAM resistance and tfall to be a constant much smaller than trise. We can obtain the ideal oscillation frequency f by using Eq. (1):
where W=1/R is the weight of the RRAM synapse. f is then proportional to the RRAM weight. Therefore, the oscillation frequency represents a weighted sum if the MIT device connects to all the RRAM weights in one column.
III. DESIGN FOR ACCURATE WEIGHTED SUM
In this section, we will set up appropriate MIT device parameters, and then discuss the dependence of the oscillation frequency on applied voltage (VDD) and RRAM weight. The simulation is based on the circuit configuration of Fig. 2 
(b).
A. Setup of MIT Device Parameters
Recent experimental study has shown that VHOLD is dependent on the electrode work function and can be as low as 0.5V, while VTH can be reduced to 1V with smaller oxide thickness [20] . In this case, the VDD is preferred to be 0.5V+1V=1.5V to make the voltage swing of oscillation centered at half VDD. However, this may disturb the RRAM resistance as the voltage across RRAM can reach 1V. In this work, we assume a VDD of 1.2V assuming that VTH can be further scaled down to 0.7V by device engineering towards smaller oxide thickness. We also assume a resistance ON/OFF ratio of 1000 can be achieved with RON=1kΩ and ROFF=1MΩ to support a wide range of RRAM weight in large-scale arrays, where the parasitic capacitance of one column can be at least several 10's fF and here we will use 100fF as a default parameter. It is noted that the ON/OFF ratio of MIT devices reported today are typically ~100, while the theoretical predicts in singlecrystalline phase it can be up to 10 5 [9] , or 10 6 if new material, e.g. SiTe, is used [21] .
B. Effect of Intrinsic Transition Time
As discussed earlier, the weighted sum will be proportional to the oscillation frequency if trise is much larger than tfall. However, this statement is under the assumption that the MIT's intrinsic transition time is negligible. To investigate the impact of transition time, we simulate the oscillation frequency as a function of transition time at two different weights 10µS and 100µS, as shown in Fig. 3(a) . Compared to the analytical results obtained by using Eq. (1) and (2), the deviation becomes more noticeable with increasing transition time larger than 10ps. Even if the oscillation frequency is small (<300 MHz) with smaller RRAM weight 10µS, the need for fast transition ~10ps is not relaxed. The reason can be attributed to the voltage undershoot below VHOLD that leads to larger trise, as shown in Fig. 3(b) . If the transition time is comparable to the RC delay in the discharging phase, the discharging would not stop until the MIT device switches back to a resistance that is high enough to start charging the neuron node. Therefore, the transition time has to be smaller than the discharging RC time to avoid this undershoot issue. It has been reported that the oscillation frequency of MIT devices with the circuit configuration in Fig. 2(b) can be up to several 10's to 100's MHz [20, 22] . It is highly probable that the reported frequency is limited by the parasitic RC delay in the off-chip electrical measurement setup. Fortunately, it has been reported the intrinsic transition time in the MIT device can be as fast as picosecond or even in the femtosecond range, suggested by the optical laser pump-probe methods [10] . Fig. 4(a) shows the oscillation frequency as a function of VDD for different weights. It can be seen that the onset of oscillation happens at VDD=VTH=0.7V. The frequency is roughly proportional to VDD beyond ~1V. This simulation result can be directly verified by using Eq. (1) and (2) . Varying VDD seems to be useful as an encoding scheme of the input vector for the weighted sum operation. However, this might not work in an array because there will be current leakage from one row to another when the row voltages are different. Moreover, VDD should not be large enough (~1.5V) to cause possible disturbance on the RRAM device as mentioned earlier. Within this limited range from 1V to 1.5V, it is difficult to split the VDD into multiple levels due to the noise consideration and practical bias circuit design constraints. Therefore, it is preferred that the input vector to be represented by digital pulses with the same VDD to avoid these issues. We will discuss this in the next section where the oscillation neurons are integrated with the cross-point array and perform array-level operations. 
C. Effect of Applied Voltage Change
D. Effect of Weight Change
The general criterion for the RRAM weight is that its resistance should be within the range of the MIT device resistance (from RON to ROFF) to make the neuron node oscillate. It is also preferred that the resistance can satisfy the condition ROFF>>R>>RON to ensure an accurate weighted sum. Fig. 4(b) shows the frequency as a function of the RRAM weight. Since RON=1kΩ and ROFF=1MΩ, the oscillation would fail when the weight is approaching 1µS and 1mS. The linear region is located at weight values from ~10µS to 100µS. This can be explained by the following: For small weights (large RRAM resistance), the RRAM resistance cannot be ignored compared to the large ROFF, thus the voltage drop on the MIT device is smaller than expected, leading to larger trise and lower oscillation frequency. For large weights (small RRAM resistance), the RRAM resistance cannot be ignored compared to the small RON, thus tfall becomes noticeable and oscillation is slowed down. In addition, the intrinsic transition time serves as a hard limit for the oscillation frequency, which will also have insignificant impact on large weights as the frequency is approaching this limit.
IV. ARRAY IMPEMENTATION FOR WEIGHTED SUM OPERATION
A. Cross-point Array Architecture
The resistive cross-point array architecture with synaptic devices has been proposed to perform the weighted sum operation in a neural network [3, 4, 7] , where the cross-point array represents the weight matrix, with the algorithm weight values mapped to the RRAM device weight range. In this work, we assume the algorithm weight values are normalized between 0 and 1, corresponding to the RRAM minimum and maximum weight, respectively. Fig. 5 shows the weighted sum operation in the cross-point array architecture. The input vector is encoded into a digital number of pulses, which controls the transmission gates at each word line (WL) row. Each row will be connected to a fixed voltage if it is selected (Si=VS), otherwise the transmission gate is turned off and the row becomes floating (unselected). Then, the total weight of a column is the sum of weights at the selected rows, where the equivalent circuit of a column becomes the configuration in Fig. 2(b) . With the MIT device connected to the bit line (BL) column, each column can oscillate at different frequencies based on the total weight of the column. The inverter at each column helps restore the oscillation waveform to the rail-to-rail rectangle pulses (VDD to 0), and the ripple counter can convert the number of pulses into a digital value (in binary fashion). However, typically the resistance of a synaptic RRAM device with continuous weight tuning has a limited ON/OFF ratio <10 [23, 24] , which makes the minimum RRAM weight not small enough thus it cannot represent a 0 value in the algorithm. To solve this problem, we add a dummy column with all the cells at the minimum weight to eliminate this weight offset. Eq. (3) shows that ideally the oscillation frequency is proportional to the weight, thus we can subtract the output value of the dummy column from that of the array column to obtain the accurate weighted sum. Finally, to complete the entire weighted sum task, we have to shift and add the weighted sum value at different input vector cycle and get the final weighted sum since the input vector is formed with digitized pulses using a binary representation. Although the cross-point array has its simple structure to perform the weighted sum operation, it has the commonlyknown sneak path problem that causes interference (or crosstalk) between cells. This problem can be found with the oscillation neuron as well. When the unselected rows are floating, they become the leakage paths between different columns as they have different oscillation frequencies, thus the frequency of each column can get disturbed by other columns. The worst case is when one column has a total weight W1, and the other columns have the same total weight W2 for each of them. Then, the voltage oscillation at W1 column may be significantly affected by the group oscillation behavior of all W2 columns.
To conduct the array-level SPICE simulation, we set the array size to be 128×128, and the minimum and maximum value of a single RRAM weight are 0.4µS and 2µS (ON/OFF ratio=5), respectively. In this case, the total weight of a column can be easily added up to several 10's to 100's µS, which is within the resistance range of the MIT device from the earlier setup. We then simulate all the possible worst cases in the array with different values of W1 and W2 at the linear weight range to analyze how much interference can occur between columns, as shown in Fig. 6(a) . The value of W2 is taken as n×W1, where n is from 1/5 to 5 because the RRAM weight ON/OFF ratio is 5. The weight difference between columns is at most 5 times with the same number of rows activated. We measure the number of pulses after the counter within 30 ns, and the results in Fig. 6(a) suggest that the deviation from the ideal number of output pulses is generally large at many combinations of W1 and W2. There are even extreme cases where no oscillation occurs at low W1 with W2>W1. Low W1 could have more floating rows, leading to larger interference from W2 columns. In addition, if W2>W1, the faster oscillation of W2 can constantly interrupt the oscillation behavior of W1. An oscillation waveform of a failure case with W1=20µS and W2=80µS is shown in Fig. 6(b) , where the MIT device never switches and the voltage just fluctuates with a small magnitude. 
B. 2-Transitor-1-Resistor (2T1R) Array Architecture
To eliminate the sneak path current that causes interference between columns in the cross-point array, a transistor can be added in series with the RRAM device as in conventional 1-transistor-1-resistor (1T1R) array architecture for memory applications. The 1T1R array architecture has been used for performing weighted sum operation with modification on the BL direction, making it to be the input row like cross-point array [25] . Similarly, the WL is in parallel with BL and it controls all the transistors on a row, thus there is no interference if the transistors on the entire row are turned off. However, in 1T1R array, different number of selected rows will affect the total 
Cross-point Synapse Array
(m×n)
Wm1 Wm2 Wmn
Si [2] Digitized Input Vector V S 0 V S 0 parasitic capacitance on the source line (SL) column, which may hamper the weighted sum accuracy according to Eq. (3). The reason for this capacitance variation is due to the transistor drain capacitance, as it can be isolated from the SL column if the transistor is turned off, otherwise it will contribute to the parasitic capacitance of the SL column. To alleviate this effect, we extend the 1T1R array by adding one more transistor adjacent to the existing transistor, constructing a 2-transistor-1-resistor (2T1R) array architecture, as shown in Fig. 7 . The additional transistor is controlled by the inverting WL signal with its drain floating. In this way, the additional transistor serves as a complementary parasitic capacitance for the SL column. Each cell will contribute one drain and two source parasitic capacitance independent of WL signal as one of transistors will be turned on with the other one turned off. With a 2T1R array size of 128×128, the total SL column capacitance is measured to be ~125fF based on the transistors in a 65nm CMOS technology. Following the same simulation setup as in the previous section, we have simulated the deviation of number of output pulses across the wide range of weight values, and the results show that the maximum deviation is only ~2%, which is a significant improvement over the results in Fig.  6(a) . Although the 2T1R architecture seems to have a larger overhead in the synapse array area compared to the simple crosspoint architecture, it should be noted that the array area is determined by the pitch of the peripheral circuits in the logic design rule. For example, the array cell height should be aligned with the standard cell height of the WL driver, which is basically the height of two transistors. Therefore, the array area overhead with the 2T1R array can be considered negligible.
C. Simulation of Weighted Sum Operation in Array
As the accuracy deviation due to the array architecture is largely resolved, we have to revisit the effect of RRAM weight change to optimize the weighted sum accuracy. Fig. 8(a) shows the oscillation frequency as a function of weight similar to Fig.  4(b) , but with a parasitic capacitance of 125fF as in the 2T1R array. From the algorithm perspective, it is expected that the weighted sum of one column in an 128×128 array should have a maximum value of 128 if all the inputs are 1 (Si=VS) and all the algorithm weight values are also 1. On the circuit side, we have to determine the read cycle time of input vector that can translate the oscillation frequency to the desired number of output pulses to match the value from the algorithm. Due to the nonlinearity in Fig. 8(a) , the read cycle time has to be calibrated at the linear weight region with the corresponding algorithm value to prevent overestimation, since the actual frequency will slightly decrease outside of the linear region. For the array implementation, the calibration should be done with both the actual column and dummy column. Therefore, a better approach is to measure the deviation between the slope of the two curves (in log-log scale) in Fig. 8(a) , as shown in Fig. 8(b) . We select two weights with the same deviation that can cancel out each other, and measure the read cycle time required for the corresponding algorithm weighted sum value. In this case, since the weight of real column (70µS) is 5× larger than the weight of dummy column (14µS), we need to calibrate the read cycle time that gives 70µS/2µS=35 pulses, and it is measured to be ~30ns. The linear region is centered at W~30μS. To improve weighted sum accuracy, the mapping from algorithm to real weighted sum result should be calibrated in a case where the slope deviation of array and dummy column can cancel out. The 5× means the maximum weight difference between columns.
Then, we run the Monte Carlo simulation with 12,800 weighted sum tasks in a 128×128 2T1R array based on the calibrated read cycle time. We assume both the input vector and weights are 4-bit values in uniform distribution. As shown in Fig.  9 , the weighted sum tasks with the calibrated read cycle time ~30ns has only a small weighted sum accuracy deviation (average is ~2.5%). However, if the application can tolerate more accuracy deviation than this, we can accelerate the read process by using a shorter read cycle. If the read cycle is reduced by 2 n times, then the final weighted sum result needs to be shifted by n bits toward the left to match the algorithm weighted sum range. Fig. 9 shows a clear tradeoff between the accuracy and the read cycle time. We also simulated the weighted sum tasks with doubled read cycle time (~60ns), however it does not show noticeable accuracy improvement over the 30ns case.
Finally, the performance of the proposed oscillation neuron is benchmarked with that of the CMOS neuron [6] at the 65nm technology node. Table I shows the sub-circuit level benchmark results. To make a fair comparison, we follow the same simulation setup as [6] . The performance is evaluated within 8 integrate-and-fire cycles with RRAM weight to be ~53µS. Despite a ~40% increase in latency, the compact oscillation neuron circuit achieves tremendous reduction in area, energy and leakage power. Table II shows the array level benchmark results. The synaptic array size is set to be 128×128 and there are 4 pulse cycles for the input vector. In practical array design, multiple columns usually share one neuron to improve the area efficiency. From the array's point of view, the oscillation neuron does not gain much benefit in total area (synapse array area + peripheral neuron area) because the total area is still dominated by the array core. However, the oscillation neuron eventually outperforms the CMOS neuron in latency. As the oscillation neuron is more compact, the number of columns shared by one neuron can be reduced from 8 to 4, thereby increasing the parallelism. 
V. CONCLUSION
The MIT device has been proposed as an oscillation neuron for the parallel weighted sum operation in the RRAM synaptic array. In this work, we studied the impact of MIT device parameters and provided design guidelines for future MIT device engineering. To enable weighted sum in large-scale arrays, a MIT device that has large ON/OFF resistance ratio is desired. The feasibility of the RRAM synaptic array with oscillation neurons is also studied. To prevent oscillation interference between array columns, the 2T1R array architecture is preferred over the cross-point architecture at negligible expense of array area. The read cycle is calibrated in the array design to improve the weighted sum accuracy. Monte Carlo simulation of weighted sum tasks shows the tradeoff between the weighted sum accuracy and the read latency. Compared to the CMOS neuron [6] , oscillation neuron shows >12.5X reduction of area at single neuron node level, and shows a reduction of ~4% total area, >30% latency, ~5X energy and ~40X leakage power at 128×128 array level, demonstrating its advantage for neuro-inspired computing. The impact of variations on the weighted sum accuracy, including the variation of RRAM weight and MIT device characteristics such as RON, ROFF, VHOLD and VTH, etc., will be studied in our future work.
