Abstract-In this paper, we study the 1-selector-1-resistor (1S1R) cross-point resistive random access memory (ReRAM) array because of its high density, fast access time, and ultralow stand-by power. Specifically, we focus on an access scheme where a data line is parallelly accessed from multiple subarrays with multibits accessed per subarray. A direct implementation of such a scheme has high energy efficiency but lower reliability compared with a single bit per subarray baseline scheme. So this paper proposes a low cost multilayer approach to improve energy-efficiency of multibits per access scheme without compromising reliability. At the cell level, we show how proper choices of bit-line and source-line voltage and SET recovery help reduce error rate by ten times. At the system level, we propose a new rotated multiarray access scheme where the average error rate of every accessed data line is one order of magnitude lower than the worst case, making it possible to achieve block failure rate of 10 −10 with a simple Bose, Chaudhuri, and Hocquenghem t = 4 code. We show that for a 1 GB 1S1R ReRAM, the proposed approach can reduce energy by 41% with 2% extra area while maintaining latency and reliability compared with the baseline system. Index Terms-1-selector-1-resistor (1S1R) ReRAM, energy, multibit per read/write, multilayer approach, reliability.
ReRAM can be organized into the 1-transistor-1-resistor (1T1R) or 1-selector-1-resistor (1S1R) array architecture. Of the two types of ReRAM array architectures, the cross-point 1S1R array architecture has higher integration density compared with the 1T1R architecture [1] and is hence considered in this paper. In the cross-point architecture, the bit-line (BL) and the word-line (WL) are perpendicular to each other and memory cells are sandwiched in between. Such a structure has 4F 2 cell area, where F is the lithography technology node. Unfortunately, the cross-point array suffers from sneak path and IR drop, resulting in lower reliability [8] . To reduce the effect of sneak paths during memory cell operation, a highly nonlinear, bidirectional selector device (1S) is serially connected with each bipolar resistor (1R) in a 1S1R cell configuration [9] . 1S1R has almost the same area as the cross-point (= 4F 2 ) structure since the selector device is vertically stacked with the ReRAM cell.
Most of the prior work on ReRAM cross-point array focused on device and circuit issues [10] [11] [12] [13] [14] [15] [16] . These include selector and ReRAM cell level designs that improve read/write margins [10] [11] [12] [13] [14] [15] . There has also been work on cross-point array organization as well as array size evaluation with respect to energy consumption and reliability. However, most of the previous work was based on "single bit per read/write" per subarray [10] [11] [12] [13] [14] , a scheme which incurred large power consumption since multiple subarrays have to be activated at a time to meet the I/O bandwidth.
In this paper, we propose a 1S1R cross-point array system with "multibit per access" per subarray that achieves high energy-efficiency and good reliability. To the best of our knowledge, this is the first work that considers energy, latency, and reliability of such an architecture. It analyzes the effect of cell-level as well as array-level variations sources on error rates and proposes a low cost scheme to maintain reliability and latency with low energy consumption.
At the cell level, we first show the effect of spatial variations and temporal variations on the resistance distribution of an ReRAM cell. We find that the errors due to temporal variation are dominated by SET failures, which can be significantly reduced by a second SET operation [SET recovery (SR)]. At the array level, we show that multibit access per read/write consumes less energy (compared with the conventional single bit access) but at the price of area overhead and lower reliability. We study the resistance distributions due to different variation sources and evaluate the corresponding bit error rate (BER). We find that the multibit group that is farthest away from the driver has the highest error rate due to IR drop.
To address the higher error rates caused by multibit per read/write scheme, we propose rotated multiarray access (RMA) scheme, where the multibit groups in a data line are retrieved from different locations in each subarray. This guarantees that the error characteristics of all data lines are the same and the BER is one order of magnitude lower than the naïve multibit access scheme. Simulation results using NVSim [17] show that RMA scheme with a simple Bose, Chaudhuri, and Hocquenghem (BCH) code with t = 4 helps to achieve block failure rate (BFR) of 10 −10 that corresponds to a lifetime of 10 years. We show that the proposed system saves energy consumption by 41% with only 2% extra area overhead compared with the baseline system.
The rest of this paper is organized as follows. In Section II, we review the ReRAM basics, including reliability characteristics.
In Section III, we analyze the effect of deviceto-device (D2D) and cycle-to-cycle (C2C) variations on the resistance values at the cell level and show how appropriate choice of BL and SL voltages can help improve reliability. In Section IV, we show how different variation sources, namely, D2D, C2C, as well as IR drop, affect the resistance distributions in an array. In Section V, we describe how the proposed RMA scheme can be used to relax the ECC requirement. This is followed by system-level evaluation of the proposed ReRAM system with respect to area, performance, energy, and reliability. We summarize the related work in Section VI before concluding this paper in Section VII.
II. ReRAM BACKGROUND

A. ReRAM Cell
The ReRAM device is a two-terminal variable resistor with two possible states-high-resistance state (HRS) or OFF-state and a low-resistance state (LRS) or ON-state. It is composed of insulating material sandwiched between two metal electrodes (MIM) [1] . The physical mechanism behind ReRAM cell switching is based on formation (ON-state) and rupture (OFF-state) of conductive filaments (CFs) in the oxides (or other insulating material) between the two electrodes. The switching from OFF-state to ON-state is called SET, while the switching from ON-state to OFF-state is called RESET.
B. Cross-Point ReRAM Array Architecture
There are two types of ReRAM array architectures: the 1T1R structure and the cross-point structure. In 1T1R array, each memory cell is in series with a cell selection transistor [8] . As the size of the transistor is typically much larger than the size of ReRAM cell, the total area of memory array is primarily dominated by transistors rather than the ReRAM cells. In contrast, the cross-point architecture has 4F 2 cell area and hence is more area-efficient than the 1T1R structure [8] . However, the cross-point architecture suffers from interference among cells and the commonly known as sneak path problem that limits the array size, increases the power consumption, and degrades the reliability [8] . A two-terminal selector device is typically added in series with the ReRAM cell at each crosspoint. The resulting 1S1R structure enables design of a largescale cross-point array by cutting off the sneak path current of the half-selected and unselected cells [8] . 1S1R has the same area with cross-point (= 4F 2 ) since the selector device is vertically stacked with the ReRAM cell.
1) Reliability Issues:
The cross-point array suffers from two well-known problems. First, IR drop along the interconnect wires. The IR drop problem becomes significant when the WL and BL wire width scales to sub-50-nm regime where the interconnect resistivity drastically increases due to the electron surface scattering [8] . During write operation, the farthest cell from the driver has insufficient voltage drop, resulting in unsuccessful write. Second, sneak path problem through the half-selected cells and unselected cells. The half-selected cells along the selected WL and BL lines conduct leakage current and form sneak paths during the read/write operation. The sneak paths contribute current to the IR drop and further degrade the read/write margin.
C. Cross-Point ReRAM System Organization
The cross-point ReRAM system organization that is supported in NVSim is shown in Fig. 1 [17] . A 1-GB bank consists of 64 × 64 mats, where each mat consists of 2 × 4 subarrays and each subarray consists of a cell array with 512 × 512 1S1R cells (512 rows with 512 bit per row) as well as peripheral circuitry with row decoders, column multiplexers, sense amplifiers, and output drivers. A subset of mats and a subset of subarrays within each mat can be activated simultaneously. Activating multiple mats and multiple subarrays per mat improve the timing performance at the expense of higher energy. While similar time performance can be achieved by activating multiple (say K ) subarrays in one mat versus K mats with one subarray per mat, the energy consumption of activating multiple mats is higher, as shown in Section IV.
1) Baseline Cross-Point ReRAM System: The conventional cross-point ReRAM system accesses single bit for read/write per subarray and so we choose this as the baseline system. If the I/O width is 64 bit, for better performance, 64 subarrays (8 mats with 8 subarrays per mat) are activated every time. Such a scheme has high energy overhead due to 64 subarrays being activated per access. In the Section VI, we propose a scheme that accesses multibit per read/write to reduce the number of subarrays that are required to be activated per access, resulting in higher energy-efficiency.
D. Simulation Settings for ReRAM Cell and Array
All SPICE results presented in this paper are based on an ReRAM device compact model [20] calibrated by IMEC's HfO 2 ReRAM (1R) [18] and the field-assisted superlinear threshold (FAST) [19] selector model in the 22-nm technology node. The CF of HfO 2 ReRAM (which is our case) is composed of oxygen vacancies as in [18] and [37] . Here, both the ON and OFF states are assumed to have the same nonlinearity of ten times, defined as the ratio of the current at V WRITE to that at V WRITE /2 [8] . The threshold voltage (V TH ) of FAST is set at 1.2 V. V TH , the tolerance for V TH variation in selectors, is set at 0.1 V. During the read operation for a single cell, V READ (= 1.35 V) is set to be larger than V TH M AX = V TH + V TH (= 1.3 V) to ensure that there is enough readout current to sense the status of the selected cells. In order to guarantee that all the half-selected and unselected cells remain OFF during write operation, 0.5 ×V WRITE (= 0.975 V) is set to be less than V TH M IN = V TH − V TH (= 1.1 V). The FAST selector increase the 1S1R's nonlinearity to 10 6 [19] . The sense amplifier is based on current mode and has a sensing speed of 10 ns [23] .
Parameter settings of the ReRAM cell, the selector, and array configurations are summarized in Table I . To guarantee a successful write operation in the cross-point array, the read and write voltages have to be boosted above the actual voltage drop on the ReRAM cell to compensate for the IR drop [8] . For array size of 512 × 512, V DD is boosted from 1.35 to 2 V for read and from ±1.95 to ±3 V for write operation so that the farthest cell from the driver can be accessed successfully.
III. EFFECT OF VARIATIONS ON
ReRAM CELL RESISTANCE
In this section, we show the effect of spatial variations or D2D variations (described in Section III-A) and temporal variations or C2C variations (described in Section III-B) on the resistance distribution of an ReRAM cell. 
A. Effect of D2D Variation on Resistance Distribution at Cell Level
We present LRS and HRS resistance distributions due to D2D variations for HfO2 ReRAM device [18] , as shown in Fig. 2 . We run 10 6 Monte Carlo simulations in MATLAB by varying the parameters of the compact device model [20] according to Table II . The variation parameters are chosen to match the experimental resistance distribution data in [12] .
When the number of programming cycles (NPC) increases, the OFF/ON ratio (defined as R HRS /R LRS ) shrinks, resulting in reliability degradation. We represent the OFF/ON ratio in terms of mean OFF/ON ratio, which is the ratio of mean R HRS to mean R LRS , and tail-to-tail OFF/ON ratio, which is the ratio of the lowest R HRS to the largest R LRS . We target NPC of 10 6 , which is the lifetime of ReRAM that most previous papers have reported [1] , [8] . For NPC of 10 6 , the tail-to-tail OFF/ON ratio is chosen to be 3 based on the experimental data presented in [12] . Mean OFF/ON ratio depends on the SET and RESET pulse strengths and varies from 10 to 30 according to previous work [34] [35] [36] . Therefore, we set mean OFF/ON ratio to be 15 (≈ (100 × 300) (1/2) ), which is the average in log scale.
B. Effect of C2C Variation on Resistance Distribution at Cell Level
The C2C variation is attributed to the stochastic nature of the oxygen vacancies/ions. Due to the randomness of the oxygen vacancy generation and ion migration at the nanoscale, the shape of the CF varies from C2C even under the same programming condition [8] .
In order to evaluate the effect of C2C variation on ReRAM resistance distribution, we vary the parameters in Table II and simulate for 10 6 consecutive cycles, where each cycle consists of an SET followed by a RESET. We found that the errors due to C2C variations are dominated by SET failures and these failures increase with NPC, as shown in Fig. 3 . SET failures can be caused by weak SET pulse or strong RESET pulse in the previous cycle (marked by the black dashed circle). SET failures due to weak SET pulse can be recovered by a second SET operation. The remaining SET failures, after a second SET operation, are due to a strong RESET pulse.
We run Monte Carlo simulations and evaluate the BER due to continuous cycling of the ReRAM cell under different SET and RESET programming conditions. From Fig. 4 , we see that a stronger SET voltage can be used to significantly reduce the SET failures. However, the reduction in BER comes at the expense of increase in the energy consumption because of increasing SET voltage. We pick SET voltage of 1.95 V in this paper since SET voltage larger than 1.95 V does not significantly reduce BER and yet incurs large energy consumption. In the rest of this paper, we use the following settings: V SET = 1.95 V and τ SET = 5 ns for SET and 
IV. ACCESS SCHEME WITH MULTIBIT PER READ/WRITE
Accessing multibit is possible by using the V /2 bias scheme [8] . Consider the N × N array shown in Fig. 5 , where N is both the number of WLs and the number of BLs. We choose the V /2 bias scheme [8] because of its lower read/write energy consumption over V /3 bias [8] and full scheme [8] . In the V /2 bias scheme, for SET operation, all the selected WLs and BLs are set to "V WRITE " and "0," respectively. For the RESET operation, the bias conditions on WL and BL are reversed to be "0" and "V WRITE " to enable bipolar switching. In both the SET and RESET operations, all the unselected WLs and BLs are set to "V WRITE /2." In this way, the access voltage on the selected cell is "V WRITE ," the half-selected cells have voltage drop of "V WRITE /2" and unselected cells ideally have no voltage drop. Bias condition for read operation is similar to that for SET operation with V READ instead of V WRITE .
Define a "group" as NB consecutive bits in a subarray, as shown in Fig. 5 . An NB-bit group can be read simultaneously by using the V /2 bias scheme [8] . However, an NB-bit write takes two steps: all the 1s are simultaneously written into a subset of cells first, and then, all the 0s are simultaneously written into the remaining cells in a group.
In this section, we evaluate the ReRAM memory system using multibit per read/write scheme with respect to timing, energy-efficiency, and area overhead in Section IV-A. We analyze the effect of IR drop in Section IV-B. We evaluate the reliability and BER in Sections IV-C and IV-D, respectively.
A. Latency and Energy Evaluation
We evaluate a memory system with I/O width of 64 bit in terms of area, energy consumption, and latency. We consider NB values of 1, 4, 8, 16, and 32. NB = 1 corresponds to the Table III describes the area, read/write energy, and read/write latency for different values of NB. The number of active mats and number of subarrays per mat are chosen such that the read/write latencies are comparable. In order to support multibit per read/write, the driver has to be larger than the baseline case. Also more sense amplifiers are required [16] . The driver size is obtained by setting current constraint to be 15 μA during SET for the cell that is farthest from the driver. The driver, based on 22-nm PTM [24] transistor model, is a two staged buffer [17] . The first stage has W /L = 1 for nMOS and pMOS. The W/L of the second stage for NB = 1, 4, 8, 16, and 32 bit is set to 2, 3, 4, 10, and 24, respectively.
From Table III , we see that energy saving is obtained by activating fewer mats and fewer subarrays per mat. First, for a given NB, the system with smaller number of active mats consumes lower energy; these are marked in bold in Table III . To better understand the reason behind this choice, consider the case when NB = 16. Since the maximum number of subarrays per mat is 8 [17] , we can choose between 1 mat with 4 subarrays or 2 mats with 2 subarrays per mat or 4 mats with 1 subarray per mat. The system with one active mat has 26.5% lower read energy consumption compared with the system with four active mats. Similarly, for NB = 8, the system with one active mat has 35.9% lower energy compared with the system with eight active mats. Therefore, we always choose the memory configuration with the smallest number of active mats. The number of active mats is 8, 2, 1, 1, 1 for NB = 1, 4, 8, 16, and 32, respectively.
Second, a system with smaller NB has to activate more subarrays at a time (to match the I/O width), resulting in higher energy. For example, the system with NB = 8 has 37%/31% lower read/write energy and the system with NB = 16 has 57%/50% lower read/write energy compared with the baseline system. This is expected since the system with smaller NB activates more subarrays at a time, resulting in higher energy. Table III also shows that the area increases slightly with increasing NB. While the driver size is larger and more sense amplifiers are used, the cell array area is significantly larger compared with driver area and so the increase is not significant. Finally, all systems have comparable read/write latency (within 2% difference) as per design requirements. The access latency increases slightly with increasing NB due to slight increase in H-tree routing delay.
From this study, we conclude that while all systems have comparable timing performance, systems with smaller NB consume more energy. The system with NB = 32 has the lowest energy but unfortunately the largest area. In Section IV-B, we will also show that the system with NB = 32 also suffers from severe reliability issues, making it an impractical choice for memory design
B. IR Drop Analysis
During read/write operations, access voltage across the selected cell decreases with increasing distance from the driver. Fig. 6 shows the write voltage drop on every cell (in HRS) along the row. For an array size of 512 × 512, with NB = 1, the write voltage drop on the farthest cell from the driver is 99.5% of voltage drop on the nearest cell from the driver; only 0.5% voltage drop occurs in the interconnection wires. For the case when there are more bits per write, the voltage loss in the interconnection wires is larger. For instance, for NB = 32, the voltage loss in wires is 12%, incurring poor reliability for the cells far away from the driver. The voltage loss for NB = 4, 8, and 16 is less than 5%, which is acceptable. So in the rest of this paper, we focus on the lowest energy configurations for NB = 1, 4, 8, and 16.
Next, we show the voltage drop as a function of location of the selected cell for write and read operations. Fig. 7 shows how the access voltage drop on the selected cells for NB = 8 and 16 decreases with increasing distance from the driver. For simplicity, we show the voltage drops of HRS and LRS for NB = 8 and 16; the trend is the same for other values of NB. From Fig. 7 , we can see that: 1) larger NB results in larger voltage loss for both read and write in HRS as well as LRS cells and 2) For a given NB, voltage loss after write is larger than that after read. This is because the selected cells suffer from larger IR drop after write (compared with after read) since write voltage is larger and hence the voltage loss in interconnection is higher.
C. Reliability Analysis
In order to evaluate the reliability of the memory system, we first derive the resistance distributions by considering the effect of the different variation sources, namely, D2D, C2C, and IR drop. To analyze the effect of D2D variation, C2C variation, and IR drop, we run 10 6 Monte Carlo simulations in MATLAB and SPICE. To obtain the resistance distributions due to D2D and C2C variations, we use the variation parameters in Table I and run the simulations. We assume that all groups have the same D2D and C2C variations since both these variations do not depend on the location of the device. To calculate the effect of only IR drop, we consider the mean value of resistance. To derive the combined effect of D2D, C2C, and IR drop, the resistance values are picked from the resistance distributions obtained using D2D and C2C variations, and the voltage drops at every location along the row of a 512 × 512 1S1R array are calculated using SPICE. The voltage drops are used to calculate the net resistance values and these values are then used to derive the resistance distributions of each group.
Table IV first lists the effect of different variations, namely, D2D, C2C, IR drop after write, and IR drop after read, one by one. All groups have the same mean OFF/ON ratio of 15 and tail-to-tail OFF/ON ratio of 3 due to D2D variations. The mean OFF/ON ratio and tail-to-tail OFF/ON ratio reduce to 6 and 1.5, respectively, due to consecutive cycling. IR drop causes the group farthest away from the driver to suffer from significant reduction in mean OFF/ON ratio. Note that we list only the mean OFF/ON ratio since we only consider the mean value of R LRS and R HRS for each group. The last entry in Table IV Fig. 8(a) shows resistance distributions of HRS and LRS caused by D2D, C2C, and IR drop after write operation in an NB = 16 system. The group which is closest to the driver, i.e., Group 0 (is marked in blue for LRS and red for HRS) and the group which is farthest from the driver, i.e., Group 31 (is marked in green for LRS and yellow for HRS). From Fig. 8(a) , we can find as follows. First, the mean OFF/ON ratio of Group 31 shrinks from 6.3 to 4.5, and in the tail-to-tail OFF/ON ratio shrinks from 1.5 to 1.2. This is because the voltage drop in the cells in Group 31 is small and so these cells cannot switch to the correct resistance value like cells in Group 0. Second, compared with R HRS distribution, R LRS has a long tail; this is caused by C2C variation. Note that the probability of the long tail crossing into the neighboring state results in an error; 3) Group 31 for both R LRS and R HRS has wider resistance distributions compared with Group 0. The intragroup voltage loss of Group 31 is larger resulting in larger BER due to C2C variations.
2) Resistance Distributions After Read: Fig. 8(b) shows resistance distributions of HRS and LRS of Groups 0 and 31 caused by D2D, C2C, and IR drop for an NB = 16 system after read. We find the following. First, mean R LRS increases by 29% while mean R HRS decreases by 12%. This is because during read operation, there is less voltage drop on R LRS than that on R HRS , resulting in larger shift on the LRS distribution due to the nonlinearity of the ReRAM. Second, the mean OFF/ON ratio of Group 31 shrinks from 6.3 to 5, and the tailto-tail OFF/ON ratio shrinks from 1.5 to 1.4. However, the tail-to-tail OFF/ON ratio of Group 31 in Fig. 8(b) is larger than that in Fig. 8(a) . This is because the cells in Group 31 suffer from larger IR drop after write (compared with after read) since write voltage is larger and hence there is higher voltage loss in interconnection after write than after read.
3) Resistance Distributions After Write and Read: Fig. 8 (c) shows resistance distributions of HRS and LRS of Groups 0 and 31 caused by D2D, C2C, and IR drop after write and read for an NB = 16 system. This corresponds to the last entry in Table IV . We find that compared with the distributions of Group 0, the mean OFF/ON ratio of Group 31 shrinks from 6 to 3 and tail-to-tail OFF/ON ratio of Group 31 is less than 1, resulting in errors. Therefore, Group 31 is highly prone to errors. 
D. Bit Error Rate Evaluation
We used MATLAB to build a simulation environment for calculating the BER of different read groups. The BER can be calculated by the ratio of the number of failures over the total number of Monte Carlo simulations. There are two types of failures-SET failure and RESET failure. In our case, SET failures dominate since LRS distributions shift more than HRS distributions (as shown in Fig. 8 ). Let SET failure be defined by R LRS > R th , where R th is 10 5 .
A group consists of NB bits and nth read/write group consists of bits from NB · n to NB · (n + 1) − 1, where n varies from 0 to 512/NB−1. We present the error performance in terms of group BER, defined as the highest BER of NB consecutive bits that form a group. For example, for NB = 8, for Group 63, the group BER is 1.5 × 10 −6 , which is also the BER of the farthest cell from the driver. The BERs of 64 groups with NB = 8 are shown in Fig. 9(a) and BERs of 32 groups with NB = 16 are shown in Fig. 9(b) . We see that BER increases as the group number increases, as expected. For NB = 8, the BER of Group 63 is the highest and is 100 times higher than that of Group 0. For larger NB, the variation in BER across the groups is larger. This is because a system with larger NB suffers from higher IR drop than the system with smaller NB. For instance, for NB = 16, the BER of Group 31 is 2000 times higher than that of Group 0. Thus, an ECC scheme that is designed to handle errors in Group 31 is an overkill for groups that are closer to the driver, such as Group 0. Also note that with SR, the BER is one order of magnitude lower than the naïve multibit access scheme for both NB = 8 and 16, thereby lowering the requirement of ECC.
E. Write Disturbance and Read Disturbance
In this paper, we do not consider write disturbance. The voltage drops on half-selected and unselected cells are ideally V /2 and 0, which are smaller than the threshold of FAST selector. The OFF leakage (∼fA) of FAST selector [19] is so small that voltage drop on ReRAM can be ignored, resulting in immunity to write disturbance.
As for read disturbance, the cell with the highest read disturbance is the one that is closest to the driver. We find that these cells would suffer from read disturbance (BER = 10 −5 ) only after 10 5 consecutive read operations.
Thus, read disturbance is unlikely to happen since the read/write ratio in memory applications is often around 10, and so new data are written into a cell long before any read disturbance can occur. So in the rest of this paper, we do not take write disturbance and read disturbance into consideration.
V. ROTATED MULTIARRAY ACCESS-
A SYSTEM-LEVEL APPROACH From Section IV, we see that multibit groups that are farther away from the driver have higher loss in voltage, resulting in incomplete read/write operation and hence poor reliability. Thus, if the data are striped across multiple subarrays, then the worst case scenario occurs when, in each subarray, the group that is farthest away from the driver is read. While the errors can be corrected by a strong BCH scheme, the area overhead due to larger parity storage is significant. To reduce the cost of ECC, we propose a new RMA scheme where the multibit groups are located in different positions in each subarray.
A. ECC schemes
In order to make the cross-point ReRAM system reliable, ECC will always designed for the worst case (such as Group 63 for NB = 8 or Group 31 for NB = 16), resulting in overdesign for the rest of groups. Here, we use BFR as the reliability metric and set a constraint of BFR = 10 −10 , which corresponds to a lifetime of 10 years [21] .We derive the BFR from BER by using the following equation [26] :
where BER is the input to the ECC, t is the correction strength of the BCH, and nis the block size, which includes the 512-bit information and 10t-bit parity. For instance, if the number of information bits is 512 and t = 7. n = 512 + 7 × 10 = 582 bit. We employ BCH code in this paper since BCH has lower code rate (= parity bits/codeword bits) compared with Reed Solomon (RS) code for the same BFR. For example, if BER is 3.1 × 10 −4 , to obtain BFR of 10 −10 , BCH t = 7 code with rate of 70/582 = 12% is required compared with RS t = 6 code with rate of 96/608 = 16%.
B. Rotated Multiarray Access Scheme
In a memory system where the I/O width is 64 bit, a data line of size 512 bit is read in 512/64 = 8 beats. Each beat here is defined as one clock tick as in commodity DRAM systems. So in each beat, 64/NB groups from 64/NB subarrays are accessed (1 group per subarray) and in each subarray, 8 groups are accessed in 8 beats. In a conventional scheme, groups at the same location in different subarrays are read. The worst case scenario corresponds to the case when the same set of 8 groups that are farthest away from the driver is read from all subarrays over 8 beats. For example, for NB = 16, the worst case is when groups 24 through 31 are read from all subarrays. For such a case, the BER = 2.21 × 10 −3 and a strong ECC (BCH with t = 14) is required to guarantee BFR of 10 −10 . The best case scenario corresponds to the case when Groups 0 through 7 are read from all subarrays. Since the BER is only 6.4 × 10 −6 for this case, BCH with t = 3 would have been sufficient.
Since the data line size is 512 bit and I/O width is 64 bit, total NG groups where NG = 512/NB are accessed in 512/64 = 8 beats to obtain 512-bit data. For every beat, M groups are read out from M subarrays to obtain 64-bit data, where M = 64/NB. Note that these M subarrays could be activated in one mat (when NB ≤ 4) or multimat (when NB ≥ 8).
To avoid the larger BER difference between the best case and worst case scenarios, we propose to access the NG groups located in NG different positions across the M subarrays. We refer to this scheme as RMA scheme. An important feature of this access scheme is that all data accessed from multiple subarrays have the same error characteristics. Moreover, the resulting BER is lower than the conventional multibit access scheme. Thus, a lower cost BCH code can be used to achieve the same level of reliability resulting in lower area and energy overhead.
A high level diagram of RMA scheme is shown in Fig. 10 . In the kth beat, one group from each subarray is read out, namely, Group j mod NG from subarray 0, Group ( j + 1) mod NG from subarray 1, Group ( j + 2) mod NG from subarray 2 and Group ( j +M−1) mod NG from subarrayM−1, where 0 ≤ j ≤ NG −1 and kis the beat number that goes from 0 to 7. Thus, after 8 beats, NG groups (Group 0 to Group NG − 1) are read out, from different physical locations in the M subarrays. The BER for the 512 bit that were read out in this way is 3.1 × 10 −4 , which is almost one order of magnitude lower than that of the naïve scheme.
An alternate scheme that also reads from different groups residing in different physical locations across M subarrays accesses Groups j mod NG through ( j + 7) mod NG from subarray 0, Groups ( j + 8) mod NG through ( j + 15) mod NG from subarray 1, Groups ( j + 16) mod NG through ( j + 23) mod NG from subarray 2, and Groups ( j + 8M−8) mod NG through ( j + 8M− 1) mod NG from subarray M−1. Both schemes have the same BER characteristics and comparable routing overhead. Finally, for the case when consecutive BLs share a sense amplifier, bit-interleaving can be employed on top of RMA, resulting in lower routing complexity. 
C. Evaluation
Table V compares the area, read/write energy, and latency for the different configurations. It also lists the BER and the BCH code that is required to guarantee BFR of 10 −10 . The BER for different groups is obtained by Monte Carlo simulations in MATLAB and shown in Fig. 9 . Conventional system with NB bits per access does not implement SR or RMA scheme. The baseline system is the conventional system with NB = 1. The BER for the baseline system is the BER of the rightmost bit. The BER for conventional systems with NB > 1 is the average BER among the eight rightmost Groups NG −8 to NG −1. The system with SR has one order of magnitude lower BER than conventional system (see Fig. 9 ). The BER for the proposed system with RMA scheme is calculated by taking the average BER among all groups and is thus an order of magnitude lower.
Table V also lists the required BCH code for each system calculated by (1) and the corresponding area overhead and decoding latency of the ECC unit obtained from [27] . Implementation of BCH code with different values of t consumes different area and delay values. For instance, BCH t = 4, 7, and 14 has decoding circuit area of 0.06, 0.08, and 0.1 mm 2 and delay of 2.3, 3.4, and 7.7 ns, respectively. Thus, decoding circuit area is quite small (<0.5% of total area) and can be ignored. Use of a BCH code with small t results in low parity storage. For instance, the baseline system requires BCH t = 4 code and has parity storage of 7.2%. In contrast, the conventional NB = 16 system requires BCH t = 14 and has parity storage of 21.5%.
The total memory area includes the area of cell array, peripheral circuits, parity storage, and ECC unit. For the proposed system with NB = 16, the breakdown is cell array area of 17.2 mm 2 , peripheral circuits area of 1.05 mm 2 , parity storage area of 1.35 mm 2 , and ECC area of 0.08 mm 2 . Energy consumption and latency are estimated by NVSim. These correspond to read/write of 512-bit data. The read latency here includes the latency of the syndrome calculation (0.5 ns), which is very small compared with the data read latency. The write latency does not include the encoding latency since it can always be hidden in the pipeline.
All systems have comparable timing performance, which depends on read latency. Note that write latency has little effect on timing performance since it can be hidden by use of the multilevel caches [28] . We evaluate all systems by weighing two metrics-area overhead and energy consumption. To achieve the same lifetime (BFR of 10 −10 ) of different systems, different strengths of ECC are employed. Conventional systems with larger NB suffer from reliability issues and hence require stronger ECC, thereby incurring larger parity storage and higher memory area. Compared with the baseline, the conventional scheme with NB = 8 improves energy-efficiency for read (write) by about 34% (27%) at the price of 2% area overhead. In contrast, the system with NB = 16 has lower read (write) energy by 46.6% (47.8%) compared with the baseline scheme, it has 25.3% extra area overhead which is unacceptable.
For NB = 16, circuit-level optimization (SR) or systemlevel RMA scheme relaxes the ECC requirement from BCH t = 14 to BCH t = 7. The system with SR has higher energy and lower performance than the system with RMA scheme so that a system with SR alone would not be taken into consideration. The candidate system with SR at circuit level and RMA scheme at system level requires BCH t = 4 code instead of BCH t = 14 code. Use of a smaller code helps to reduce the area and read/write energy due to lower parity storage compared with the conventional NB = 16 system. Fig. 11 shows the memory area and energy of different systems based on read/write ratio of 10. As shown in Fig. 11(a) , the memory area increases with increasing NB values. This is because the system with larger NB has lower reliability and hence requires stronger ECC to maintain BFR of 10 −10 . The area differs from system to system due to additional parity storage and peripheral circuits. For example, compared with the baseline system, the conventional system with NB = 16 has 25.3% higher area consumption due to use of BCH t = 14 ECC. Circuit-level optimization (SR) and systemlevel RMA scheme help system with multibit per access (NB > 1) to maintain the same reliability with little additional area. For example, compared with the baseline system, the proposed systems with NB = 16 only has 2% area penalty. Fig. 11(b) compares the energy consumption of the different systems. We see that the energy decreases with increasing NB. We find that with the multilayer techniques, while the energy consumption reduces slightly for systems with NB = 4 and 8, for NB = 16, the energy consumption reduces by 59%. After weighing two metrics-area and energy-efficiency, the proposed ReRAM system (NB = 16) with multilayer technique is the best option. It has the lowest energy consumption, which is, only 41% of the baseline system, with only 2% area penalty.
VI. RELATED WORK
Existing work on 1S1R cross-point memory focuses mostly on the selector design to achieve significant reduction in the half-write current [9] [10] [11] [12] [13] [14] [15] or increase the nonlinearity of the ReRAM cell to minimize the IR drop and effect of sneak paths [10] [11] [12] [13] , [19] . At the array level, strategies to partition large arrays into multiple smaller subarrays to increase the overall read/write performance have been proposed in [29] and [30] . Multilevel design of ReRAM spanning array, bank, and chip levels is proposed in [29] . The reliability study in [30] is based on read noise margin of sense amplifier and does not take into account errors in the ReRAM cell. Also, work in [29] and [30] evaluates the reliability based on the worst case scenario which is dictated by the cell located farthest away from the driver. However, in their evaluation, the variability sources, such as those due to D2D only, C2C, and IR drop, have not been considered, resulting in inaccurate estimation of reliability.
Also, most existing 1S1R array systems are based on single bit per read/write per subarray [10] [11] [12] [13] [14] . In order to reduce latency, multiple subarrays have to be activated, resulting in high energy consumption. A multibit per access scheme has been suggested to improve the energy-efficiency in [15] and [16] . It has been shown that the driving current requirement and corresponding area overhead for each word line in multibit per access scheme is much larger than that of single-bit per access scheme. However, the focus has mostly been on the design of the peripheral circuits such as drivers and sense amplifiers to support multibit per access; reliability issues due to a multibit per access scheme have not been considered. In contrast, this paper is a comprehensive study of energy, latency, and reliability of a 1S1R cross-point array architecture with multibit per access.
Another competitive ReRAM technology is based on 1T1R. The 1T1R ReRAM cell has the same density as 1T1C DRAM cell, featuring 6F 2 cell area (where F is the lithography technology node) and does not have the sneak path current problem of cross-point array. At cell level, prior work for 1T1R focuses on fabrication procedure as well as retention and endurance [31] [32] [33] . At the circuit level, related work [34] [35] [36] shows the effect of different programming conditions on endurance. At the system level, our previous work shows that how voltage settings (pulse amplitude and pulse width) of WL, source-line, and BL voltage can be used to lower latency, lower power, and improve reliability [21] , [28] .
VII. CONCLUSION
In this paper, we propose a multilayer technique to improve energy-efficiency and reliability of ReRAM cross-point systems with minimum area and latency overhead. At the cell level, we find that the errors due to temporal variations are dominated by SET failures, which can be significantly reduced by SR. In contrast to existing systems which are based on single bit per read/write, we propose to use multibit per read/write. At the array level, we show that the system with multibit per read/write has very high energy-efficiency but lower reliability due to voltage loss in interconnect wires. We study the resistance distributions due to different variation sources and evaluate the corresponding BER. Since the BER for a group with multibit which is far away from the driver is much higher than a group near the driver, the ECC has to be designed for the worst case scenario when the data access only includes groups that are far away from the driver. So we propose RMA scheme, a new data access scheme where the data are striped across multiarray such that the constituent multibit groups are located in different positions in each subarray. We show that if the group size is 16 bit, then the RMA scheme-based system can reach BFR of 10 −10 by using BCH t = 4 code instead of BCH t = 14 code that is needed for the naïve multibit access scheme. Simulation results using NVSim show that the proposed scheme for ReRAM system with multibit per read/write outperforms a system with single bit per read/write in terms of energy while maintaining latency and reliability with only a small area overhead.
