Abstract -Recently, 2-D cross-point array of resistive random access memory (RRAM) has been proposed for implementing the weighted sum and weight update operations to accelerate the neuro-inspired learning algorithms on chip. This paper aims to extend such 2-D cross-point array to 3-D vertical array for storing and computing the large-scale weight matrices in the neural network. Considering the fabrication and 3-D integration of analog synapses (i.e., multilevel RRAM devices) are premature at this stage, we propose using today's available digital or binary RRAM devices for implementing a ternary neural network, which aggressively reduces the weight precision to ternary levels (+1, 0, −1) for the weighted sum in both feedforward and backward inference, while the multiple 3-D layers could serve for accumulating the small errors in a higher precision format for weight update. Compared to the 2-D implementation, the proposed 3-D vertical implementation shows larger read/write margin for weighted sum/weight update, smaller latency, and energy consumption for weight update. This paper demonstrates the attractiveness for building a monolithic 3-D neuromorphic hardware platform.
neural networks. In the recent years, emerging nonvolatile memory such as phase-change memory (PCM) [2] and resistive random access memory (RRAM) [3] , [4] based "analog" synapses that exploit the multilevel states have been demonstrated at single device level. Furthermore, there are a few experimental implementations of simple neural networks on small-scale arrays (e.g., 12 × 12 TiO x /Al 2 O 3 crossbar [5] ) to medium-scale arrays (e.g., 256×256 PCM array [6] ). However, there are significant design challenges identified by the recent device-algorithm co-simulations [7] , [8] for using such analog synapses. Although the neural networks are capable of tolerating the random effects such as device variations or noises to certain degree, the systematical effects, particularly the nonlinearity in the weight update process (the conductance versus # of programming pulses), may remarkably degrade the learning accuracy. Unfortunately, almost all the reported analog synapses [2] [3] [4] [5] [6] [7] [8] suffer from this nonlinear weight update. Alternatively, in this paper we propose using the more mature digital RRAM as "binary" synapses to avoid such nonlinear weight update problem [9] with the co-design of algorithm compression and 3-D array architecture.
The neuro-inspired algorithms generally utilize large weight matrices in multiple layers (i.e., hundreds by hundreds or even thousands by thousands). To load all the weight elements on-chip and eliminate the off-chip memory access, it is very attractive to extend the 2-D cross-point array to 3-D array to improve the integration density. Among various monolithic 3-D integration approaches of RRAM [10] [11] [12] [13] [14] , the 3-D vertical RRAM (V-RRAM), especially the word-plane type of cross-point array which makes the RRAM cell sandwiched between the pillar electrode and the multilayer plane electrodes, is very attractive due to the bit-cost scalability with only one critical lithography step and fewer number of necessary interconnection lines [15] . Therefore, this paper aims to explore the 3-D V-RRAM arrays for implementing the weighted sum and weight update operations to accelerate the neuro-inspired learning algorithm. A comparative study between 3-D V-RRAM array and 2-D cross-point array implementations is performed in a case study algorithm named ternary neural network (TNN). TNN is a multilayer perceptron (MLP) algorithm that aggressively reduces the weight precision to ternary levels (+1, 0, −1), as inspired by the recent trend of network pruning and parameter compression in the deep learning community [16] , [17] .
Different from other papers on the neuromorphic applications of 3-D V-RRAM exploiting the stochastic switching behavior [18] , [19] or the spiking behavior [4] , [20] , this paper focuses on the more mature binary RRAM with good fabrication yield. Moreover, all of the prior work [18] , [19] utilize all the pillar electrodes as input vector and the plane electrodes as weighted-sum outputs. Such operation scheme limits the number of output neurons that equals to the number of vertical layers. To overcome such limitation, in the paper, we propose a novel operation scheme to combine the selected lines and the word-plane electrodes as input vector, and all the BLs are designed as weighted-sum outputs.
The remaining sections are organized as follows: In the Section II, we introduce a TNN case study that we compress the multilayer neural network (MLP) to (+1, 0, −1) ternary weights for feedforward/backward inference, and then describe how we map such neural network to the 2-D RRAM array architecture. Section III explains 3-D V-RRAM architecture and the proposed operation scheme to implement TNN. Section IV shows the comparison results between 2-D and 3-D implementations including the design space exploration on write/read margin, and the performance evaluation on area/latency/energy consumption. Finally, the conclusions drawn in Section V.
II. TERNARY NEURAL NETWORK

A. MLP Neural Network Setup
As a case study, we used a MLP neural network with MNIST handwritten digits [21] as the training and testing data set to implement online training and classification. The network topology and procedure description of the MLP algorithm is shown in Fig. 1 . The proposed 400-200-10 network has three layers: input layer of 400 neurons corresponding to the 20 × 20 pixels as we cropped the edges of the original 28 × 28 images, hidden layer of 200 neurons, and output layer of 10 neurons corresponding to the 10 classes of digits (0-9). Therefore, it has two synaptic weight matrices (W H IH : 400 × 200, and W H HO : 200 × 10). Here the superscript H means high precision format, and subscript IH means input to hidden layer, and HO means hidden layer to output. To support the low precision computation, the MNIST input images are converted to black and white (1-b data). All the value of weights and neurons are regularized between −1 and 1, and those value are converted to binary representation Commonly, a high precision format of 6-b (including 1-b for sign) is needed for the synaptic weights for MNIST data set [9] . The reason is that the back-propagation passes the small training errors from the output layer to the input layer, if the precision is insufficient, such small errors will not be accumulated in the weight update. In the feedforward inference stage, first we make use of two's complement coding scheme illustrated in Table I 
where f denotes the sigmoid activation function [22] with limited s-bit precision truncation. When s equals one, such function would be simplified to Heaviside step function.
In the back-propagation stage, first we need to calculate the training errors from the output layer to the input layer by the stochastic gradient descent method. When we compute the activation gradient of each layer, we keep using the ternarized weight matrices W L IH and W L HO . However, when we do the weight update to accumulate the training errors, the higher precision weight matrix W H IH and W H HO are used.
B. Adaptation to the 2-D Cross-Point Array
As mentioned above, one of the most important procedures in the algorithm is the weighted sum (or the matrix-vector multiplication). In the hardware implementation of learning algorithms, the weighted sum could be performed using conventional 2-D cross-point architecture, as shown in Fig. 2 . The conductance of each RRAM device represents a matrix element or synaptic weight. Since in this paper we adopt binary RRAM cell, multiple RRAM cells need to be grouped together to represent a high precision weight (for weight update), from sign bit, MSB to the LSB. In this paper, we suppose the weights have 8-b (including 1-b for sign) precision (for MNIST data set, which could be higher for larger data set), which means the size of 2-D cross-point architecture is required to be at least 400 (WLs) × 1600 (BLs) to map the largest weight matrices W H IH of TNN. For feedforward interface, as shown in Fig. 2(a) , analog input vector values, are represented by the pulse number to the rows, and output vector values are represented by the column currents, which allows all the read operations, multiplication operations, and accumulate operations to occur in parallel. The weighted sum relationship between the output vector and input vector can be derived
where I 1 I 2 . . . I n are the output current of each BLs, which form the output vector, S 1 S 2 . . . S m are the digitalized (0/1) input vector to control the biased voltage 0/V r of WLs. W is the conductance matrix of the 2-D cross-point array.
For weight updates, illustrated in Fig. 2(b) and (c), we use read before write and row-by-row write scheme to update the conductance of RRAM cells. For read operation to calculate the weight update amount ( W ), the WLs connected to the selected cell are biased to V r , while all the other WLs and BLs are biased to 0, and the current of the selected cells are read out by the current sense amplifier and the W is calculated by the peripheral neuron circuits. For write operation, the WLs and BLs connected to the selected cell are biased to V W and 0, respectively, while all the unselected lines are applied with V W /2. Then the new W is written back to the array. It is noticed that the number of stacked layers k is determined by the feature size (F), the etching aspect ratio (AR), the thickness t i of the isolation layer, and the thickness t m of the plane electrode, which can be calculated as k = (F × AR)/((t i + t m )). State-of-the-art 3-D V-RRAM experimental prototype [19] has 4 layers. Here, F is defined as the diameter of the pillar electrode (d) plus twice of the RRAM oxide thickness (t ox ), which is also half pitch between the centers of neighboring pillar electrodes. Another constraint parameter in the V-RRAM architecture is the drivability of the vertical transistor at the bottom of the pillar which decreases as the pillar diameter decreases. Based on the scaling trend shown in [24] , we adopt 100 μA as the saturation current for F = 30 nm.
B. Adaptation to the TNN
To adapt to the neuromorphic computing, we use the vertical dimension of 3-D V-RRAM to denote the precision of weights. Therefore, eight layers of 3-D V-RRAM are needed to satisfy the 8-b weight representation requirement. Essentially multiple RRAM cells along the same pillar electrodes form one synapse. Here, we denote the RRAM cells in the first layer as the sign bit of each synapse, and the second to the kth layer as the MSB to the LSB, respectively. Similar to the operation scheme of 2-D cross-point array, the operation scheme of 3-D V-RRAM can be divided into two operation modes: feedforward/backward inference (read) mode and weight update (write) mode. In the inference mode, a new read scheme (other than conventional memory application) is proposed. As discussed in Section II-A, as only 2 b (1 b plus sign bit) are needed. Therefore, only the first two bottom layers are activated simultaneously. The WL voltage of the first layer is biased to V r and the WL voltage of the second layer is biased to −V r , all the neuron inputs are applied to control the SLs, while other WLs and BLs are grounded, as shown in Fig. 4(a) . In this way, the weighted-sum relationship between the output vector and input vector can be derived
where I 1 I 2 . . . I n are the output current of each BLs, which form the output vector. S 1 S 2 . . . S m are the digitalized (0/1) input vector which are used control the SLs. P 1 and P 2 are the conductance matrix of the first layer and the second layer, respectively, which store the ternary truncated information of the whole synapse weights. In this way, we realize the m inputs and n outputs dot-product function in parallel. In the weights update mode, read before write scheme is used and the 3-D V-RRAM operates as the traditional memory. For the read operation, we follow the "read-in-a-row" scheme: the WL voltage of the selected layer is biased to V r , while other WLs and BLs are grounded, as shown in Fig. 4(b) . In this way multiple cells in the same row are read out, and the current through each BL will be sensed by the currentmode sense amplifier. The current-mode sense amplifier is typically used in today's RRAM macro design due to a smaller latency as compared with the voltage-mode sense amplifier. In this paper, we assume that a current-mode sense amplifier is designed to sense the read out current from BL, and the minimum I = 80 nA is used as the criterion for read margin of a fast sensing latency within tens of ns [25] . After the read operation, the weight update amount ( W ) calculated in the peripheral neuron circuits and then the new W is written back to the 3-D array. For the write operation, we follow the "V /2" write scheme, the WL and BL voltage of the selected cells are the biased to the write voltage (V W ) and 0, respectively, while other BLs and WLs are biased to the half write voltage V W /2 to prevent the unintentional write, as shown in Fig. 4(c) . The worst case cells are located farthest from the input of WLs and BLs with all other cells being the low resistance state. 
IV. PERFORMANCE ANALYSIS
A. TNN Sensitivity Analysis
As shown in Fig. 5(a) , we explored the accuracy of the MLP online training with different weights and neuron precision. The result indicates the MLP could achieve the best accuracy when it was trained with the synaptic weights and neuron activation functions being highest precision (32-b floatingpoint). When we truncate MLP 400-200-10 structure to ternary weights and 8-b neuron, there is a slight degradation of ∼1% accuracy drop. When we further truncate to ternary weights and 1-b neuron, i.e., TNN, there is another <1% accuracy drop. Such slight accuracy degradation is within the expectation [16] , [17] .
Considering the cycle to cycle and device to device variation of RRAM conductance, we take further exploration on the accuracy of the TNN online training with different variation of synaptic weights, as shown in Fig. 5(b) . In the simulation, we assume both the high resistance state and low resistance state of RRAM exhibit Gaussian distribution with standard deviation σ . The result demonstrates that TNN has a good tolerance to the weight variation up to 20%, but larger variation would cost more time to converge and result in accuracy drop.
B. RRAM Simulation Parameters
In the following of this section, we will perform the design space exploration of the 2-D array and 3-D array for implementing the weight-update operation [including the read scheme in Fig. 4(b) and write scheme in Fig. 4(c) ]. To set up appropriate values for V W and the criterion for the write margin, the switching voltage distribution in an array should be considered as RRAM devices are well known for the switching parameter variability [26] . For typical RRAM devices with an average switching voltage of 2 V, a possible switching range of 1.7-2.3 V is assumed. In this paper, the write pass criterion is set to 2.5 V to ensure a safe write operation, and V W is set to 3 V to obtain a 0.5 V tolerance of access voltage drop on the interconnect resistance. Meanwhile V W /2 = 1.5 V is lower than 1.7 V to avoid the disturbance of the half-selected cells.
A summary of the parameters used in this HSPICE simulation is listed in Table II which are adapted from the previous work [27] , [28] . These parameters are also close to state-ofthe-art four-layer 3-D V-RRAM experimental prototype [19] . Here Ron and Roff are referred to as the resistances at V W . The RRAM device I − V nonlinearity, which can be increased by using series selector devices or tuned by band structure engineering of heterojunction oxide materials, is defined as the ratio of current at V W to that at V W /2
where R V W and R V W /2 denotes the resistance of RRAM cell at V W to that at V W /2, respectively. It is noted that the RC delay of the 3-D V-RRAM is negligible (<ns) from previous HSPICE simulation [29] , therefore, in this paper we only conduct dc simulation to evaluate the IR drop and sneak paths. structure. We first explore the design space of those two structures with different R ON (I w ) and nonlinearity ratios K w configurations under the worst case data pattern [27] , [30] , as shown in Fig. 6 . It shows that a high R ON or a high K w tends to decrease the read current of the selected cell which results in the read failure, while a high I W or a low K w tends to increase the write current via the unselected cells and enlarge the interconnect IR drop which results in the write failure. The interconnect IR drop effect seems more severe for the 2-D structure than the 3-D structure, which could be attributed to the fact that WL in the 2-D structure is eight times longer, and WL in 3-D structure is essentially a plane.
We further investigate the multibit write for both structures. Theoretically, the entire row of the selected WL can be read or written in parallel. In practice, only a small number of bits are written at the same in a planar cross-point structure. The primary reason is that the total current on the selected WL increases dramatically as the number of selected cells increases. It degrades the write margin and incurs high area overheads of WL drivers. Fig. 7 plots the write margin and write power per bit as a function of the number of bits (N b ) that are written in parallel in a 3-D V-RRAM array and the 2-D array. In Fig. 7(a) , the write margin degrades slightly as the N b increases, suggesting that multibit write operation is feasible in 3-D V-RRAM. While the write margin dramatically decreases as the N b increases for the 2-D array, and even for 2-b it is below the 2.5 V criterion. This is because in 2-D array, more cells written in parallel means larger current flows in the selected word line (WL) resulting in larger IR drop. But the 3-D word-plane structure has a much less resistance than the WL, thus even we choose more # of cells to be written in parallel, IR drop effect is insignificant. The array's write power per bit as a function of N b is shown in Fig. 7(b We also estimate the latency, area and energy performance of 2-D and 3-D implementations. According to the design space in Fig. 6 , we assume the RRAM on-resistance 500 k , off-resistance 50 M , and the nonlinearity 20X for the analysis. For online training, the RRAM programming condition is assumed to be (3 V/50 ns), and the read condition is assumed to be (1.5 V/20 ns). The read voltage is boosted to compensate for the nonlinearity of low resistance state. The performance comparison result (for one step of inference and weight update) is shown in Table III . First, the 3-D structure saves the area by eight times if only the array core is considered. For the feedforward inference, both 2-D and 3-D RRAM array perform the matrix multiplication in parallel with just one step, and the energy consumption of 2-D cross-point array is larger than that of 3-D V-RRAM due to the energy consumption of line resistance in the 2-D structure is larger than that of plane electrode resistance in the 3-D structure. For online training, we estimate the cost of the latency, area and energy the latency to perform the update of the all the cells in the array. According to the Fig. 7(a) , write row-byrow (200 b in parallel) scheme can be used in 3-D structure, but only 1-bit write scheme is allowed in 2-D structure to meet the safe write criterion. The large latency and energy of 2-D structure are mainly contributed by the sequential RRAM write operation bit by bit. Overall, 3-D V-RRAM shows much better performance in terms of latency (for weight update) and energy (for both inference and weight update) of than 2-D cross-point structure.
V. CONCLUSION
Through a comparative study of TNN which aggressively reduces the weight precision to ternary levels (+1, 0, −1), this paper takes the further exploration of neuromorphic computing accelerator from 2-D cross-point structure to 3-D vertical integration. On the one side, TNN benefits the implementation with the current available binary RRAM devices to overcome the nonlinearity of weight update problem caused by the premature "analog" synapses. One the other side, compared to the 2-D implementation, the proposed 3-D V-RRAM implementation shows larger write margin for weighted sum/weight update, smaller latency, and energy consumption for weight update. This paper demonstrates the attractiveness for building a monolithic 3-D neuromorphic hardware platform.
