Abstract-Transistor-based memories are rapidly approaching their maximum density per unit area. Resistive crossbar arrays enable denser memory due to the small size of switching devices. However, due to the resistive nature of these memories, they suffer from current sneak paths complicating the readout procedure. In this paper, we propose a row readout technique with circuitry that can be used to read selector-less resistive crossbar based memories. High throughput reading and writing techniques are needed to overcome the memory-wall bottleneck problem and to enable near memory computing paradigm. The proposed technique can read the entire row of dense crossbar arrays in one cycle, unlike previously published techniques. The requirements for the readout circuitry are discussed and satisfied in the proposed circuit. Additionally, an approximated expression for the power consumed while reading the array is derived. A figure of merit is defined and used to compare the proposed approach with existing reading techniques. Finally, a quantitative analysis of the effect of biasing mismatch on the array size is discussed.
I. INTRODUCTION
O VER the last decade, emerging nonvolatile memories (NVMs), such as phase change memory (PCRAM), ferroelectric memory (FeRAM), spin transfer torque magnetic memory (STT-MRAM), and resistive memory (RRAM), have shown high potential as alternatives for floating-gate-based nonvolatile memories [1] . RRAMs are considered the best candidate for the next generation nonvolatile memory due to their high reliability, fast access speed, multilevel capabilities and stack-ability creating 3D memory architectures [2] . To achieve higher density memories, switching devices alone are sandwiched between the crossbar metal layers without using access devices such as transistors, diodes and selectors [3] . In some cases, the switching devices might have exponential behavior (the selector is embedded inside the switching device) such as FAST selector [4] . The main drawback of selector-less (gate-less) crossbar-based memories is the sneak path effects which limit the readability of the array. Conventional reading approaches for selector-less crossbars suffer from the sneak path loading which makes reading the data very difficult, and even impossible at times. On the other hand, writing is done through accessing one device at time using one of two bias schemes; either 1/2 bias scheme or 1/3 bias scheme [5] .
The sneak paths problem arises because there are many paths from the inputs to the outputs. Figure 1a shows the sneak path in 2 × 2 crossbar array. The sneak path current is added to the main path current which disturb the cell reading. Figure  1b shows the cumulative probability of the readout current for This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 512 × 512 crossbar array with LRS = 1M Ω, HRS = 1GΩ and 10Ω line resistance. The readout currents corresponding to LRS and HRS are totally overlapped. As a result, it is impossible to find a threshold to differentiate between the two states even with very large switching resistance values. Recently, different readout techniques were proposed to address this problem in high-density arrays using two main procedures. (a) Reading and writing the stored data multiple times such as [6] , [7] which require an Analog to Digital Converter (ADC), as well as registers and comparators. (b) Dispersing predefined dummy cells in the array to assist in reading the data such as [8] , [9] . These dummy cells should be initialized first, which requires more than one cycle to read a certain cell and requires locality property, ADC and comparators, all of which limit the applicability of the technique. Other techniques have been proposed to read the data in parallel by adding sensing resistors to the bitlines and sense the voltage across the resistors [10] which loads the crossbar and does not mitigate the sneak path effects.Yet another approach is to keep the bitlines floating and sense the bitlines voltages as discussed in [11] . The floating bitlines schemes can read array sizes up 128 × 128 without errors. Both resistive load and floating reading schemes are suitable only for small size arrays. In this brief, a sneak path mitigation readout technique for high density 3D resistive memories is proposed in addition to the required peripheral readout circuitry.
II. PROPOSED READOUT TECHNIQUE
Our target is to design circuitry that can be used for reading stacked resistive crossbar arrays. As is clear from previously published techniques, the sneak path problem appears due to the existence of many paths between the input and the outputs. To avoid this problem, we present a simple solution which mitigates the sneak path problem and maximizes throughput. Consider the case, where all the crossbar array input and output ports are biased to a certain bias voltage, V B , as shown in Fig. 2a . For simplicity, assume that the input and output ports are grounded. Consequently, no current flows across the crossbar array. However, when a V DD signal is applied to one of the input rows, current flows from this input to all the outputs and no current flows across the other rows where the voltage drop across the other rows is zero as shown in Fig. 2b . The current is absorbed by the sensing circuit. This current is proportional to the resistance of the switching device. In this way, the sneak current paths are eliminated. By reading this current, it is easy to distinguish between the low resistance and high resistance states. The architecture is inherently parallel, where all the row data can be accessed in parallel, enabling high throughput applications. Memory Architecture: To read the (i, j) cell, a V DD signal is applied to the i th row and the current of output port number j is sensed where 1 ≤ i ≤ M and 1 ≤ j ≤ N . This path can be modeled as resistor, R m , between input and output ports which is either LRS or HRS. The maximum and minimum resistance can be used to determine the sensitivity of sensing circuit as will be discussed later. It is worth noting that this technique can be generalized to any crossbar size since the sneak current that is resultant from multi-paths is mitigated by this reading technique. Figure 3 shows the full schematic of the single layered crossbar array. The crossbar inputs are connected to the read decoder which selects only one input according to the input address. The selected line is biased by V DD and unselected lines are biased to V B . Grounding the lines is a special case of the general bias voltage V B . Furthermore, if smaller word-lengths are needed, banking can be applied as shown in the figure, where each bank has size of M × n and the total number of banks is R = N/n. Consequently, one row per bank is read each clock cycle which means n readout circuits are needed. In order to choose between the banks, an analog mux is necessary which can easily be constructed using switches. Note that the bitlines of all unselected banks should be biased to V B to guarantee the aforementioned scenario. The outputs of the analog mux are connected directly to the reading circuits which bias the selected bank to V B and sense the current.
Effect of Wire Resistance: Wire resistance is inevitable in such crossbar arrays which effects the reading technique making the voltage across the selected devices less than V DD − V B by a factor which is a function of the stored data and cell location. Due to the random nature of the data, it is hard to estimate this factor analytically. Figure 4 shows the sensed current density of each cell in 512 × 512 array, in addition to the histogram of the sensed current for both linear and nonlinear devices with 10Ω wire resistance and 10% resistance variations in both states. Generally, the wire resistance creates leakage paths from the selected wordline to the unselected wordlines. However, with these input to input leakage paths, the technique is still able to distinguish between LRS and HRS with wide current range for linear devices. On the other hand, the nonlinear devices have exponential voltage-dependency modeled as I = k × sinh(aV ) where V is the voltage across the resistive device, k and α are the fitting parameters [8] . Using such devices improves the sensing current range due to the high resistance facing the leakage currents in the input ports. III. PROPOSED CURRENT SENSING CIRCUITRY The readout circuit to sense the output current of the crossbar should satisfy the following specifications: 1) the sensing terminal has a fixed bias voltage, V B , and 2) the range of sensed current needs to be identified. One way to satisfy these requirements is to use the current conveyor concept where the applied voltage of port 1 is mirrored to port 2, while the input current to port 2 is mirrored to port 3 as shown in Fig. 5a [12] (see supplementary material S1 for more details about current conveyor working principle). P ort 1 can be biased to V B which is mirrored to port 2 and the input current to port 2 is around (V DD − V B )/R m,i,j where R m,i,j is the (i, j) cell resistance which is either LRS or HRS. At this point, it is easy to distinguish between the LRS and HRS by designing suitable readout circuit.
The continuous behavior of RRAMs enables the ability of storing multi-level data. The proposed reading technique can be used to read multi-level resistive memories as well since it reads the selected cell resistance only. Thus, a suitable readout circuitry that differentiates between the states is needed. In this work, we focus only on reading binary resistive memories. The proposed circuit is divided into two parts; 1) current sensing circuit which should have the aforementioned specifications which works as a transimpedance amplifier and 2) a latched comparator to distinguish between the two states and also to latch the data.
A. The Proposed Current Sensing Circuit
The proposed circuit is based on the current conveyor concept [12] , [13] as shown in Fig. 5b . The voltage V B is the biasing voltage and can take any value as long as M 1 and M 2 are kept in the saturation region. M 2 and M 4 are designed so that the current passing through M 2 is mirrored to M 4 . Consequently, the current passing through M 1 is equal to the current passing through M 3 . Assuming that all transistors are in the saturation region, It can be proven that V in = V B (supplementary material S3.1). Now, the first condition in the reading circuit is satisfied.
By applying KCL at the input node, the current passing through M 5 is I 5 = I 11 + I in − I 3 . This current is mirrored through M 6 to the output node and imposed into the load resistance creating the output voltage, V o = V DD −(I 11 +I in − I 3 )R L where I 11 is mirrored from I 10 and its value is αI ref where α is the ratio between the aspect ratios. I 3 is a constant current and equals I 1 due to the current mirror effect. The analysis for the value of I 1 can be found in the supplementary material. Consequently, the output voltage is
Thus, we have two outputs, V oh and V ol corresponding to HRS and LRS, respectively. It is necessary to widen the difference between V ol and V oh to easily distinguish between the states, which is possible by controlling the values of R L , α and I 1 . This circuit can be followed by either a buffer or a latch circuit so that the output swings between V DD and V ss . To avoid having a large loading resistor occupying a large area, with limited output voltage range, a latched comparator is necessary.
B. Latched Readout Circuit
Instead of reading the output current from the loading resistor, this current can be connected directly to a latched comparator as shown in Fig. 6 . The gate voltage of M 5 changes depending on the current passing though it which is mirrored in M 6 . The mirrored current is compared with the constant current generated by M 7 due to constant voltage V c . The M k transistors are used to reduce kick back noise where the latch signal voltage goes back to the input signal which may alter the data. This latched comparator is introduced and analyzed in [14] . In the reset case, both outputs are equal to V DD where the output of the XOR gate is zero.
On the other hand, in the set case, V Latch = V DD . Fig.  7 illustrates an example of transient simulation of reading random data from a certain column using the proposed circuit. The readout circuit is designed to satisfy the aforementioned conditions (see supplementary material S2.3 for circuit values). The circuit is implemented using TSMC65nm. The area of the entire readout circuitry is about 55µm 2 with 1.92µW total power consumption, at 1.2V supply.
IV. DISCUSSION AND COMPARISON

A. Power Consumption Estimation
Power consumption is very critical in resistive memories since they consume a lot of power during the reading and the writing operations due to their inherent resistive nature. Thus, it is essential to estimate the power consumption inside the crossbar array as well as the power density. The device nonlinearity highly affects the power consumption where the higher the nonlinearity, the higher the resistance which implies less power.
In case of linear devices, where the resistive device states are constant and not a function of the applied voltage, the power consumption for reading one wordline (for both with and without banking) can be approximated by multiplying the voltage drop times the input current, ignoring wire resistance,
. Thus, the maximum and minimum power consumption are around M N R (V DD − V B ) 2 /LRS and M N R (V DD − V B ) 2 /HRS, respectively. on the other hand, in case of the nonlinear devices, the voltage across the switching device is still around V DD − V B . Thus, the power consumption per wordline is
) by ignoring wire resistance. Figure  8 shows the reading power consumption with and without including the line resistance for different biasing voltages and different array sizes. The figure is plotted for nonlinear switching devices, with k on , k of f and a are 1e−8, 1e−11 and 3, respectively [7] . Clearly, by increasing the biasing voltage, V B , a lower power consumption is obtained. However, this reduces the sensing current margin which highly affects the sensing circuit. Therefore, it is important to study the effect of changing V B . Figure 9a shows the effect V B over the voltage swing and the delay. As previously discussed, the bias voltage should be greater than twice the transistor threshold voltage which is around 0.63V . The output voltage swing exhibits critical curve with maximum point at around V B = 0.75V . Also, the delay for both reading scenarios (reading LRS then HRS and vice versa) is also shown. In our design, we set a practical target of 1ns for the delay. The best bias voltage is 0.7V to maximize the voltage swing. Another aspect that should be studied is the value of the input voltage. Figure 9b shows the effect of changing V in with fixed V B = 0.7. The higher the input voltage, the more the voltage swing, the less delay and more power consumption as shown in Fig.8 .
M 1 − M 4 transistors work as as a gain boosting circuit. Hence, the input impedance of the sensing circuit is high (≈ 1M Ω). Due to the abrupt current changes while reading, V in is disturbed (+20mv and −35mV around V B ). The negative feedback of the gain boosting circuit works to recover V in to V B . In our design, the loop recovers in 1ns. Thus, the delay of the designed circuit is 1ns. Practically, two phases are needed due to the latched comparator; (a) a reset phase where the output is set to V DD and the data is setup and (b) a latch phase where the data is latched and stored. Thus, another 1ns is needed to latch data for 50% duty cycle clock. Per these parameters, the energy consumption of the readout circuit per bit is 7.6f J for 500M Hz clock frequency.
B. Figure of Merit
In [11] , a figure of merit (FOM) is defined for comparing different reading techniques taking into account important metrics such as throughput, and array usage. This FOM is defined as:
where the numerator reflects the array metrics; Throughput, which is the number of read bits per cycle ( bank size), and Array Usage which is the number of usable data bits divided by the total number of bits. The denumerator, reflects per bit the physical parameters, power per bit and cell area. It is reported that the array density of memristor based selector-less crossbar arrays is approximately 640Gbit/cm 2 where the feature size is 6.25nm [8] . Table I shows a quantitative analysis for different reading techniques for a complete N × N array. We chose to compare with these four techniques which can accommodate high density crossbar arrays. This comparison illustrates the differences between prior work and the proposed approach in terms of power, throughput, array usage, and FOM.
The estimated area of the 512 × 512 crossbar is around 40.96µm
2 and the estimated area of the read decoder is 4.26µm 2 based on the technique published in [16] , [17] with two pre-decoding stages. To estimate the sensing circuitry area, a 128 bit word is considered where the array is divided into 4 banks. The estimated area of 128 sensing circuit including the MUX is around 41.65µm
2 . The area of CMOS circuitry is estimated with respect to 5nm technology which is expected to be used with crossbar arrays. According to these estimated numbers, the entire CMOS circuitry can be placed under the crossbar array. And, the static power dissipation is dominated by the sensing circuits which is around 285.7µW based on 1.92µW per bit.
In this comparison, the exponential RRAM model is used with the aforementioned parameters values and feature size. The main advantage of the proposed technique is the ability to read the entire row bits in one clock cycle which is vital in memories on the contrary with the other techniques which requires at least N clock cycles to read the entire row. However, the proposed technique consumes more power due to the nature of the reading technique. It is worth noting that other published works do not account for the readout circuitry, which should be accounted for in addition to any building blocks such as ADCs and comparators [6] , [7] . The circuits proposed in this work can be used in other approaches as well such as with predefined dummy bits and grounded rows and columns [8] , [9] , which requires virtual ground and comparator.
V. BIAS MISMATCH EFFECT
In resistive memories, it is required to bias wordlines and bitlines to specific voltages based on the technique used. For example, in the proposed reading scheme, the unselected wordlines are connected to V B , and all the bitlines are connected TABLE I: Performance metric of reading a complete N × N array. These results are adopted from [8] , [15] . to V B through the sensing circuit. Thus, there would be a mismatch among the wordlines bias voltages themselves and with the bitlines bias voltages creating undesirable current due to PVT variation, wire parasitics and switch's resistance. This undesirable current is unavoidable and needs to be taken into consideration since it would limit the array size.
The voltage mismatch is column independent and is not correlated to other columns. Hence, the number of resistive devices that are affected by the mismatch is N − 1 devices. This mismatch is referred to as ∆V . In case of linear switching devices, the current passing through each unwanted device is either I LRS = ∆V /LRS or I HRS = ∆V /HRS. For equal probable states, the total unwanted current is I unW = ∆V (0.5N/LRS + (0.5N − 1)/HRS). Usually the ratio between HRS and LRS is 10 3 or more, then the total unwanted current is approximated to I unW ≈ 0.5N ∆V /LRS. However, the desired current for high/low resistance state is
respectively. Consequently, the extreme input current to the sensing circuit is I t = I W ± I unW for high resistance and low resistance states, respectively. This sensed current affects the input voltage of the comparator V x directly, which should not exceed the noise margin of the comparator. For 10mV noise margin for the comparator, the maximum/minimum input sensed current, which are corresponding to LRS and HRS, are I max = 0.22µA and I min = 0.195µA, respectively. The maximum column width, N , is the minimum of 2 * I max * LRS/∆V or 2 * I min * LRS/∆V which is 195 for 2mV bias mismatch.
On the other hand, in the case of nonlinear switching devices, the current passing through each unwanted device is I LRS/HRS = k on/of f sinh(a∆V ) for low/high resistance state. Since ∆V is very small in range of millivolts, the current can be approximated to I LRS ≈ k on a∆V or I HRS ≈ k of f a∆V . The total unwanted current is I unW = a∆V (0.5N k on + (0.5N − 1)k of f ). Usually the ratio between k of f and k on is 10 3 or more, then the total unwanted current is approximated to I unW ≈ 0.5N a∆V k on . However, the wanted current for high/low resistance states is I W = k of f /on sinh(a(V DD − V B )). Consequently, the extreme input current to the sensing circuit is I t = I W ± I unW for high resistance and low resistance states, respectively. The maximum column width,N , is the minimum of 2 * I max /(∆V * a * k on ) or 2 * I min /(∆V * a * k on ) which is around 6500 for 2mV bias mismatch. With nonlinear devices, the column width of the crossbar is highly increased due to the high resistance facing the mismatch voltage.
VI. CONCLUSION AND FUTURE PERSPECTIVE The proposed reading technique with the readout circuitry can read the entire row without using reference bits giving the maximum utilization of RRAM and the highest throughput for high speed applications. These advantages come with more power consumption (4.7X more than [8] ). According to the defined FOM taking into consideration, power consumption, array usage, effective array size, and throughput, the proposed technique is more than 100X better than [8] without banking. Moreover, according to the discussed studies for the sneak path immunity, power consumption and bias mismatch, the nonlinear devices are most recommended for high dense resistive memories. In addition, the proposed technique is compatible with the published writing techniques [5] where switches are placed around the array to enable reading or writing since reading and witing can not be performed simultaneously in the same array.
One of the features of the resistive memories is its ability to store multi-levels/multi-states such as ternary and quaternary data enabling higher higher radix processing units. The proposed technique can be applied for multilevel memories as well due to its ability to read the device resistance especially with nonlinear switching devices. However, the readout circuitry needs to modified to accommodate the multi-states and be able differentiate between them. This topic will be investigated in future research.
Stack-ability of resistive memories is another feature enabling ultra dense memory arrays [18] . The discussed readout technique alongside the circuitry can be used to read each crossbar layer by connecting the corresponding crossbar outputs together then to readout circuits. A level decoder is needed to select the readout level. Using this configuration, only one row in a certain level is selected at a time where the other outputs result in zero output current. The stacked layers share the same reading circuitry which decreases the overhead of readout circuits. For instance, Xpoint memory is 2 layers resistive memories sharing the same bitlines and having two different wordlines. The proposed readout circuitry can be connected to the bitlines and the wordlines are used to access the memory cells. The power density is one of the important aspects of any electronic circuit. By using the aforementioned technique, the power density is approximately constant due to reading only one row per level at time. However, stack-ability might cause other reliability issues which limits the number of stacked layers, and is currently a subject of intensive research by the community.
