Abstract-The use of memristors and resistive random access memory (ReRAM) technology to perform logic computations, has drawn considerable attention from researchers in recent years. However, the topological aspects of the underlying ReRAM architecture and its organization have received less attention, as the focus has mainly been on device-specific properties for functionally complete logic gates through conditional switching in ReRAM circuits. A careful investigation and optimization of the target geometry is thus highly desirable for the implementation of logicin-memory architectures. In this paper, we propose a crossbarbased in-memory parallel processing system in which, through the heterogeneity of the resistive cross-point devices, we achieve local information processing in a state-of-the-art ReRAM crossbar architecture with vertical group-accessed transistors as cross-point selector devices. We primarily focus on the array organization, information storage, and processing flow, while proposing a novel geometry for the cross-point selection lines to mitigate current sneak-paths during an arbitrary number of possible parallel logic computations. We prove the proper functioning and potential capabilities of the proposed architecture through SPICE-level circuit simulations of half-adder and sum-of-products logic functions. We compare certain features of the proposed logic-in-memory approach with another work of the literature, and present an analysis of circuit resources, integration density, and logic computation parallelism.
I. INTRODUCTION

F
OR some time now, advances in semiconductor technology have continued to boost both the memory capacity and computing speed of modern computers. In this regard, among today's various emerging memory technologies [1] , resistive G. Papandroulidakis and G. Ch. Sirakoulis are with the Department of Electrical and Computer Engineering, Democritus University of Thrace, Xanthi, 67100, Greece (e-mail: gpapandr@ee.duth.gr; gsirak@ee.duth.gr).
I. Vourkas and A. Abusleme are with the Centro de Investigación en Nanotecnologia y Materiales Avanzados, Department of Electrical Engineering, Pontificia Universidad Católica de Chile, Santiago, 7820436, Chile (e-mail: iovourkas@uc.cl; angel@uc.cl).
A. Rubio is with the Department of Electronic Engineering, Polytechnic University of Catalonia, Barcelona 08034, Spain (e-mail: antonio.rubio@upc.edu).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNANO.2017.2691713 random access memory (ReRAM) based on resistive switching nanodevices (memristors or memristive devices) [2] stands out as one of the best-studied and most promising candidates for next-generation nonvolatile memory (NVM) applications [3] - [5] . ReRAM has several attractive properties, such as fast operation, low power consumption, multilevel single-cell storage, and very high integration density (4F 2 footprint, where F is the minimum feature size of the process technology), owing to the simple, dense, high-connectivity structure of the nanocrossbar geometry [6] , [7] . Additionally, a paradigm-shift is needed in computing systems beyond the classical and so far dominant von Neumann architecture, which separates storage from computation in distinct units [8] . Data processing in von Neumann systems is normally carried out sequentially, thus requiring a lot of information exchange and communication between the central processing unit(s) (CPU(s)) and data storage module(s). Moreover, today's computers are able to process data at CPU speeds much faster than their memory-access speed, making the latter a true bottleneck [1] . In an attempt to overcome such limitations and further improve data-processing efficiency, research has therefore recently begun to focus on brain-inspired (neuromorphic) and, more generally, in-memory computing approaches [9] - [15] .
In this context, the memristor provides an unconventional computing framework, ideally combining resistance-based information storage and processing in a single device [16] . Several recently published logic circuit design approaches [17] - [24] use memristors as binary elements in a digital platform, offering functionally complete logic gates and promising to maximize the benefits of digital computing in future system architectures in which memory and processing co-exist. Nevertheless, mostly owing to the fact that memristor device technology is still at an early stage, such papers have primarily focused on the digital logic realization procedure, logic gate implementation and/or the device-level requirements, and almost all of them have omitted (or left for future research) the study of the underlying ReRAM organization, the target circuit architecture, and the impact of the driving circuitry. Further research is necessary at the circuit and/or architecture level to make logic computations as parallel as possible and thus enable practical application [25] - [28] . It would be particularly interesting to see whether and how such logic design approaches could practically fit in compact memristive storage circuit architectures, exploiting the nonvolatility of memristors in normal power-off (and thus more energy efficient) logic-in-memory circuits.
In this paper, we aim to address precisely this gap. More specifically, we propose an early approach to a crossbar-based in-memory processing system in which the binary information stored in memristors is locally processed in the same unit: i) without the stored data being affected while it is used as logic input; and ii) keeping the logic result in the state of the memristors, where the computation takes place. The basic concept is to take advantage of the possible heterogeneity of cross-point devices (and/or combinations of them) within dense crossbar arrays, while also using a novel group-accessing scheme for the selection lines of the target ReRAM cross-points. Our study particularly focuses on the ReRAM organization, taking into consideration: i) device-level memristor properties, by incorporating a threshold-type SPICE-compatible switching model of bipolar voltage-controlled memristors [29] , [30] ; ii) circuit-level properties, by adopting a functionally complete memristor-based digital logic design methodology that allows for single-step multi-input parallel logic computations [23] ; and iii) architecture-level details assuming state-of-the-art high-density and CMOS-compatible memristor-transistor crossbar geometry with a group-accessed vertical (which could be thought of as a nano-pillar vertical gate-all-around or VGAA) transistor as the cross-point selector underneath each memristive device [31] - [33] . In this way we achieve: i) memory and logic operations through gate-controlled resistance switching with no current sneak-paths [34] ; ii) a much smaller cell footprint compared to that of traditionally planar transistors; and iii) lower operating power compared to that of a typical passive cross-point array. We justify our choice of the memristor logic implementation methodology and comment on the overall system performance, the integration density, and the achieved parallelism of the logic computations. We provide SPICE circuit simulation results to confirm the proper operation of the proposed architecture for examples of sum-of-product and halfadder (HA) logic functions, highlighting the advantages of the proposed logic-in-memory approach compared to the MAGICin-crossbar logic proposed in [28] .
II. TOPOLOGY DESCRIPTION AND ANALYSIS
A. Main Topological Features
The general floor plan, showing the basic modules included in the proposed architecture, is given in Fig. 1 . Overall, it consists of two banks of crossbar arrays and their supporting circuit modules. In fact, there are two separate nano-pillar (vertical) transistor-memristor (1T1R) crossbar arrays, i.e., the memory crossbar, at the bottom of the figure, and the computing crossbar, which is larger, located at the top. Black dots generally denote cross-points, whereas the horizontal and diagonal rectangles denote the topology of the transistor selection lines (SLs), i.e., the groups of the select transistors whose gate terminals are simultaneously driven.
As shown in Fig. 1 , the cross-point stacking structure of the memory crossbar array comprises horizontal word lines (WLs) at the top and vertical bit lines (BLs) at the bottom. Drivers for the WLs and SLs are assumed on opposite sides of the array to better distribute the layouts of the peripheral circuitry of the CMOS devices used for the selection and application of the voltage pulses required in each access operation. The inset shows a cross-section of part of a memory word with the ReRAM cells (shown as a two-material stack without loss of generality) stacked directly on top of group-selected nano-pillar VGAA transistors. We assume that the ReRAM and the selection device layers are deposited sequentially, with the transistors placed on the bottom layer (fabricated as front-end transistors) to limit the influences of the parasitic capacitance and resistance and to minimize the area penalty. An excellent VGAA transistor in this case could be the Si nano-pillar MOSFET by [31] , which demonstrates very good gate controllability and less than 0.1nA leakage current and renders a 4F 2 footprint cross-point cell, much smaller than the 8-12F
2 of the traditionally planar 1T1R cell. Moreover, group-accessing of the SLs makes it possible to minimize the number of transistor gate lines from a total of WLs×BLs (when accessing every transistor separately) to only WLs, thereby simplifying significantly the crossbar fabrication process as specifically discussed in [3] .
During the state programming and/or reading memory processes, the input signal simply flows from the WL to the BL. The latter (depending on the access operation) can be either grounded, left floating, or driven to the sense amplifiers via control switches, as shown at the bottom of Fig. 1 . More specifically, write and read signals are applied to the target WL, whose respective SL is activated. Reading is a one-step process: all the BLs are driven to the sense amplifiers (one for each BL). Writing, however, is a two-step process: since the bias direction is mutually reversed in the SET and RESET processes for bipolar ReRAM cells, they are performed in separate cycles [34] . For instance, the BLs of the ReRAM cells that will be SET are grounded first, while the rest of the BLs are left floating. In the very next cycle, the previously grounded BLs are left floating and the rest are grounded for the RESET process.
On the opposite (top) side of this array, the BLs are connected to summing amplifiers (whose role will be explained later), which separate the memory array from the computing array, found at the top of the figure. The cross-point stacking structure of the computing array is practically the same. For the orientation of its top/bottom nanowires we will assume that the vertical logic input/output lines (LILs and LOLs respectively) are at the top of the structure, while the horizontal routing lines (RLs) are at the bottom. Whether each logic line is named LIL or LOL depends on whether it is connected to the summing or sense amplifiers, respectively. As shown in the respective arrow diagram in the inset, during computations the input signal follows a circular flow, starting from the WL of the memory array, moving to the BL and the LIL through the summing amplifiers, then to the RL, and finally to the LOL and the sense amplifiers. A set of control switches makes it possible to drive the output of the summing amplifiers to the LIL. Depending on the activated SL, the signal always flows through two 1T1R cells, i.e., through two memristors (with the same or opposite polarities) and two transistors, all connected in series via a common horizontal RL. The basic concept is to employ complementary material stacking structures in different cross-point cells of the computing array, i.e., the regular ReRAM stack and the one with a reversed material deposition order [35] - [37] , in order to have both forward-and reversely-polarized memristors for the purposes of the computations, which are explained in the following section.
Furthermore, in order to maintain the high controllability and favorable implementation properties offered by the groupaccessed cross-points, while also permitting the parallel execution of several logic computations, we introduced a novel geometry for the SL of the computing array, assuming twisted transistor gate nanowires. We will next show that when this group-selection strategy for the twisted SL is used, there are no disturbing current sneak paths when multiple SLs are driven, provided the simultaneously driven SLs follow certain acceptable patterns. The simple two-terminal structure of memristors and the proposed twisted SL topology enable the integration of digital logic computations in the crossbar, where the required serial ReRAM connection is naturally achieved [23] . However, this novel SL geometry entails a small area overhead. The 1T1R cross-points with the nano-pillar transistor will require a rowand column-pitch of 3F in order to accommodate the transistor channel width (1F), the gate surrounding the channel (1F), and the spacer (1F) [36] . Therefore, the computing cross-point cell area becomes 9F 2 , i.e., 2.25× larger than that of the 4F 2 footprint memory crossbar.
B. Basics of ReRAM-based Logic Circuits
The memristor-based logic computations in our work rely on the memristive logic family proposed in [23] . It is a parallelprocessing, functionally complete logic design scheme, unlike some published sequential processing approaches, such as the CMOS-like [17] , MAGIC [20] , and IMPLY [22] logics. It enables the parallel execution of single-step digital logic operations, exploiting the threshold-dependent switching behavior of memristors and of their simple series connection. Compared to other parallel processing logic design concepts, such as the MRL proposed in [21] , it enables the execution of a wider variety of logic operations based on devices with sharp transitions (filamentary or threshold-type) instead of linear (or homogeneous) switching devices, which generally respond more slow to the applied input signals [38] . A comprehensive overview and comparison of the aforementioned logic approaches was presented in [24] .
This memristive logic family [23] uses the total conductance (memductance) of the devices for the parallel computation of AND, OR, NAND, NOR, XOR, and NOT logic operations. Fig. 2 summarizes the general circuit concept for the implementation of the aforementioned logic gates and sum-ofproducts logic functions. Understanding the overall circuit behavior, sometimes based on collective dynamics of two properly polarized memristors, requires comprehending the switching dynamics of individual memristors first. To this end, it is worth noting that bipolar memristors with opposite polarities will tend to switch their states in a reciprocal manner [39] . Hereinafter we will refer to a memristor being forward/reversely polarized (FPM/RPM) when the voltage is applied to the top/bottom terminal with the bottom/top terminal being grounded; the bottom terminal is always denoted by the thick black line in the circuit schematics. Moreover, we will assume that the resistance will decrease when the memristor is forward biased and increase when it is reversely biased. In fact, in threshold-type switching memristors the resistance change-rate is very fast above (and negligibly slow below) the voltage threshold V SET or V RESET , which determines the SET (R OFF → R ON ) or RESET (R ON → R OFF ) transition, respectively.
In this logic design scheme [23] the input voltages consist of: i) a very low voltage for logic '0'; and ii) a voltage higher than the threshold V SET (and/or |V RESET |) for logic '1'. However, the key idea is that any binary input logic combination is encoded into a corresponding positive aggregate input voltage (the sum of the separate input voltages), which is then applied to the input terminal of the memristive gate, the latter being any of the six options shown in Fig. 2 (a). We will now briefly describe how the overall conductance (memductance) can be used for such logic computations. For example, a single FPM will switch from a low conductance (L) to a high conductance (H) if either (or both) of the applied inputs is logic '1', i.e., if it exceeds V SET . Likewise, when two FPMs are in series, the composite memductance will rise from a low value (L') to a high value (H') only if both inputs are logic '1', meaning that the aggregate input voltage will exceed 2 × V SET . Apparently, memductance in these two cases defines the OR and AND logic operations, respectively, as functions of the aggregate applied input voltage. Thus, requiring the "aggregate" input voltage explains the use of summing amplifiers in the proposed system, although different circuit techniques could be also used for the same purpose. The operation of the rest of the logic gates is explained in a similar manner in [23] , [24] . For instance, a single RPM implements the NOR gate since it will switch from a high conductance (H) to low conductance (L) if either (or both) of the applied inputs is logic '1', i.e., if it exceeds V RESET . Likewise, when only one logic input is considered, the RPM is equivalent to a NOT gate as well.
Because it is based on memristors connected only in series, this logic design fits well with the structural specifications of the proposed system. Furthermore, since memristors with opposite polarity are required for some of the logic operations, we assume that both FPMs and RPMs are readily available inside the heterogeneous (in terms of the cross-point devices) computing array. This concept of heterogeneity was first proposed in [40] . Logic operations are conducted via conditional switching of nonvolatile ReRAM cells, featuring no gain. Any logic input combination will have an irreversible effect on the conductance of the memristors. For instance, if we first apply either "10" or "01" and immediately after that "00," then due to the nonvolatility of the memristors the final result will not be correct. Consequently, it is necessary to initialize every memristive gate via a reset pulse in between each input logic combination, a requirement common to several memristive logic design approaches [20] , [22] . To this end, initialization drivers access the RL and LIL/LOL and are assumed to be distributed around the computing-array. This notwithstanding, there is no need for a reset step between "01" and "10" inputs. The same is true if the next input, after a "10" or "01," is "11," meaning that algorithmically the efficiency of such logic circuits could be improved by evaluating the number of '1s' in the input logic combination; however, that falls beyond the scope of this paper.
C. Operational Features and Performance
1) Computing Flow characteristics:
Having explained the basic topological characteristics of the proposed system, as well as the basics of the logic design scheme using memristors, we will now provide a specific example to highlight the most Fig. 3 . System snapshot highlighting operational details for parallel logic computations. The inset shows the equivalent circuit schematic between word lines (WLs) and logic input lines (LILs). A cross-section of two parts of the computing array indicates the 1T1R cell structure, the memristor polarity, and the signal flow between LIL and LOL.
important system-and circuit-level operational properties. The system floor plan shown in Fig. 3 includes a 5 × 6 1T1R memory array at the bottom and a 5 × 10 1T1R computing crossbar at the top. These dimensions are merely indicative for the purposes of this example and do not demonstrate any particular requirement for the dimensional ratio of the two crossbar banks. However, it is worth noting that, in order to perform two-input logic computations, two BLs from the memory array should correspond to one LIL from the computing array, i.e., there should be 2× as many BLs as LILs. For logic operations with more than two logic input variables, more BLs should correspond to each LIL and the computing ReRAM cells should be changed accordingly, as described in [23] .
In this context, Fig. 3 provides a snap-shot of a system supporting only two-input logic operations. Among the selection lines, those being activated are highlighted in light blue. The aim is to involve the data stored in the memory array (without affecting them) in multiple single-step parallel logic operations. In this example, we assume that we want to perform two parallel logic operations with the two red-dot pairs of cells of the activated memory word. The red dots generally denote the crosspoint cells currently being used. Thus, a read voltage pulse (with an amplitude lower than the switching threshold of the memristors) is applied to the target WL and the corresponding horizontal SL is driven. The BL control switches at the bottom are left floating and all BLs are connected to the summing amplifiers at the top. The inset shows the equivalent circuit schematic; the WL input signal passes through two 1T1R cells before reaching the R S input resistors of the summing amplifiers. In this way, depending on the binary state of these memristors and, thus, on the voltage drop on them, the corresponding BL will have either a high (logic '1') or low (logic '0') voltage. In other words, through the summing amplifiers, we compute a weighted sum of the commonly applied WL read voltage, which identifies the stored binary state of the memristors. For example, the case "R OFF , R OFF " = = "00" will give a very small voltage sum, whereas "R OFF , R ON " = = "01" (or equivalently, "10") will give a high voltage sum, and "R ON , R ON " = = "11" will result in the highest input voltage sum, depending on the R S and R F resistor values. This sum is in fact the input voltage for our logic computation and is driven to the LIL through the corresponding control switches. In this example we show only two of them in closed position, one for each parallel logic operation. Next, the LIL signal passes through the series-connected ReRAM cells (via the bottom RL), whose twisted SLs are driven, finally reaching the LOL, where another set of control switches (two of them are again shown closed) drive the logic output to the sense amplifiers.
The logic gates always involve ReRAM cells that are connected to a common RL. The inset in Fig. 3 also shows a crosssection of two particular parts of the same RL of the computing array, with a view to clarifying the cross-point stack structure, the polarity of the memristors, and the current flow, which passes through the two series 1T1R cells. The symbol denotes a signal entering the plane of the paper vertically, whereas denotes a signal coming out of the plane of the paper at that point. As can be seen, the input logic signal crosses two memristors with opposite polarities, implementing an XOR gate (see Fig. 2 ). The other simultaneous two-bit logic operation shown in Fig. 3 (without specifying the cross-point cell type), involving the leftmost pair of red-dot memory cross-points, occurs in the same fashion.
2) Exploitation of Cross-Point Heterogeneity: As mentioned previously, the proposed computing crossbar bank consists of an array that is heterogeneous in terms of its cross-point devices. Specifically, we assume that different existing ReRAM cells might have a reversed material deposition order, thus enabling both forward and reversely polarized memristors that are physically connected in series via the RL to form the desired ReRAM-based logic gates. Furthermore, single-cell repetitive material-stacking structures could enable the implementation of two series ReRAM devices (with the same or opposite polarity) in a single cross-point [35] - [37] , i.e., 1T2R cells. Additionally, we assume the existence of typical 1T routing junctions (i.e., vertical transistor cross-points without ReRAM device) at certain RL-LOL intersections to directly drive a logic signal from the RL to the LOL. The latter are required, e.g., for single-memristor logic gates (see Fig. 2 ), where computation is completed in the LIL-RL cross-points.
This heterogeneity concept is summarized in Fig. 4 , which shows the computing crossbar bank configuration for an example of three parallel logic operations. This example consists of a sum-of-products computation, a NAND, and an XOR gate, which take place in RL 5, 3, and 1, respectively. Ideally the array could be divided thematically in several islands (sub-arrays) depending on the cross-point type. Specifically, the LIL-RL crosspoints may include either 1T1R or 1T2R cells with all possible polarity combinations, whereas 1T1R or 1T cells are assumed in the LOL-RL cross-points. The 1T1R LIL-RL cross-points (Fig. 4(d) ) should be combined with 1T1R (Fig. 4(c) and (e)) or 1T (Fig. 4(g) ) LOL-RL cross-points to either physically connect two ReRAM devices (Fig. 4(d) and (e)) or route a single ReRAM computation directly to the LOL, respectively. Similarly, 1T2R (Fig. 4(b) and (f)) LIL-RL cross-points might be combined with either 1T1R or 1T LOL-RL cross-points to either cascade logic gates in sum-of-products computations (Fig. 4(b) and (c)) or route a two-ReRAM computation directly to the LOL (Fig. 4(f) and (g)), respectively.
3) Twisted SL Driving Patterns: The proposed twisted gate nanowires (SLs) for the nano-pillar transistors of the computing bank simplify the array organization and the select/deselect schemes of the ReRAM cells used in logic circuits. Given the need to be able to perform as many parallel computations as possible, using a different SL geometry would not work due to current leakage/sneak-paths [39] , [41] , which contribute to incorrect computations and/or increase power consumption. For example, neither row nor column SLs would work with our approach. With column SLs, i.e., in-line with the LIL/LOL, the output current of a summing amplifier in a particular LIL would flow through all the cross-points sharing the same column SL. Likewise, if row SLs are used in the memory bank, only one logic operation could be performed at a time.
A relevant example of this row SL (horizontal) geometry is shown in Fig. 5(a) , which assumes two parallel logic computations taking place in RL 3 and 5, respectively. In the first, the current should flow only through LIL 4, then RL 3, and finally LOL 5. Similarly, in the second, the current should flow only through LIL 2, then RL 5, and finally LOL 3. Unfortunately, current from LIL 2 and LIL 4 also flows through the cross-points indicated with dashed squares, which share the same gated SL, thereby increasing the likelihood of incorrect logic computations. Not only could such current sneak-paths affect the results of logic computations, even if they did not, they would increase power consumption.
On the other hand, when the twisted SLs are properly gated, they successfully address the sneak-path effect and, most importantly, enable the maximum number of simultaneous logic computations, which is equal to the number of LOLs. Fig. 5(b) shows a snapshot of such an array, where all LIL and LOL control switches are closed and the maximum number of logic operations takes place in parallel using the ReRAM cells of the anti-diagonal (it could be the main diagonal if the symmetric orientation were selected for the twisted SLs) of the two sub-arrays defined by the LOL and LIL columns. However, current sneakpaths could still appear as shown in Fig. 5(c) , which shows the SLs and the control line configuration for three parallel logic computations in RL 1, 2, and 6. As in Fig. 5(a) , here currents from LIL 2 and 4 also contribute to incorrect computations.
Therefore, in order to perform more than one parallel logic computation safely, the twisted SLs of both the RL-LOL and the RL-LIL sub-arrays of the computing crossbar bank should be properly configured to prevent the creation of current-leakage paths. One simple rule for achieving this is as follows: any pair of an LIL and an LOL simultaneously in use should never have (i) two unused gated cross-points, or (ii) one unused and one used gated cross-point connected via an RL (i.e., in the same In keeping with the aforementioned rule, in Fig. 6 we show all six SL gating combinations enabling maximum, safe utilization of the LIL/LOL columns simultaneously using an indicative 6 × 6 array example. According to these patterns, every LIL/LOL column of the array has only one gated cross-point in use (red dot), such that every RL connects only two gated cross-points. In fact, assuming an array with m rows and n columns (m × n), there are m different SL gating patterns that enable the simultaneous use of all n LIL/LOL columns, utilizing in parallel up to n of the m × n total cross-points. Finally, it is worth mentioning that using the same LIL signal in two simultaneous logic operations is also possible if the SL gating pattern used complies with the aforementioned rule.
III. SPICE SIMULATION RESULTS
As a proof of concept, in this section we present a simulationbased validation of the normally-off ReRAM logic circuits implemented in the proposed crossbar-based topology, using the Cadence PSPICE. The logic computations were performed through conditional resistance switching in ReRAM cells, serially connected according to the applied transistor gating signals, thereby implementing the logic gates. The logic circuit examples consist of a half adder (HA) with XOR and AND logic gates for the Sum and Carry bits, and a sum-ofproducts (SoP) (see Fig. 2 ). In the simulations we employed a SPICE-compatible threshold-type model of a voltagecontrolled bipolar memristor, which attributes the resistanceswitching effect to the modulation of an effective tunnelingdistance [29] . should ideally be minimized as much as possible to ensure high reliability [41] , [42] . In the simulations, the threshold values were V SET = |V RESET | = 0.3 V and the read/programming pulse amplitudes were properly chosen. The memristor switched its state at a µs regime, based on the abovementioned parameter values, as soon as the voltage drop on it exceeded either of its thresholds.
In keeping with the proposed system floor plan shown in Fig. 1, Fig. 7(a) shows the general configuration of the equivalent circuit we simulated in SPICE. It shows an m × n memory crossbar and a k × n computing crossbar, with all interconnections modeled using an RC delay model with wire resistance and capacitance (R w = 12.78 Ω and C w = 0.046 fF), the latter being the absolute values between adjacent cross-points under the 130 nm technology node, according to [43] - [45] . In particular, for the HA and SoP circuit examples we simulated a circuit consisting of a 4 × 4 1T1R memory array with horizontal WLs and SLs, and a 5 × 4 1T1R heterogeneous computing array with two LILs, two LOLs, and twisted SLs (TSLs). The latter were called input TSLs (iTSLs) or output TSLs (oTLSs) depending on whether they were in the RL-LIL or the RL-LOL sub-array of the computing bank. The size of the simulated circuit was found adequate to demonstrate proper circuit operation; however, the effect of parasitics on 1T1R large crossbar arrays of size up to 16 Kbit were presented in detail in [45] . In the memory bank below, the data stored in BL 1−2 (BL 3−4 ) defined the two-bit logic input voltage in LIL 1 (LIL 2 ). Thus, there was no need to program the logic inputs in the four memory words since the Fig. 8 . Simulation results for the circuit in Fig. 7(c) considering (a) an AND gate and (b) a XOR gate for logic input "11". The graphs show the input voltage pulse (opamp output), the voltage drop on the two memristors, and the memristance evolution with time. WL read voltage was 0.12 V and 0.25 V, respectively. memristor pairs in BL 1−2 and BL 3−4 held all possible two-bit input combinations, which were then sequentially driven to the logic gates as we applied a 0.12 V read voltage pulse to one of the WLs (WL 1−4 ) .
The required cross-point heterogeneity is example-specific. Consequently, we introduced variations in the simulated circuit, i.e., since AND and XOR gates require memristors with the same and opposite polarity, respectively, we assumed both FPM and RPM cells for the HA example, as shown in Fig. 7(a) . On the other hand, for the SoP example, we used 1T2R cross-point cells with two FPMs stacked in series (AND) in the same cross-point in the RL-LIL sub-array (see Fig. 4 ). Moreover, all the CMOS control switches and the cross-point nano-pillar transistors in SPICE were modeled as n-FETs under standard 130 nm CMOS technology. The applied transistor gate voltages were either 0 V or 3 V. For the summing amplifiers, we used opamps with 100 GHz gain bandwidth product and R S , R F = 1, 24 KΩ resistors. Their output consisted of a voltage lower than the memristor thresholds for logic "00," a voltage −0.54 V for logic "01" or "10," and a voltage −1.1 V for logic "11." This voltage was driven to the LILs through the input control lines (CILs), whereas the logic output was obtained at the intermediate node of the LOLs and the 2 KΩ pull down resistor (R PD ) network via the output control lines (COLs). Due to the negative sign of the summing amplifier output, the devices in the computing bank had the opposite polarity of that shown in Fig. 2 , without loss of generality. Moreover, for simplicity and to avoid resetting the memristive gates in between the application of different inputs, we alternated between different memristive gates of the same type for different input combinations, so the memrsitors in the logic gates were always properly initialized.
Reducing the interconnect effect is critical to achieve reliable ReRAM operations [45] . Therefore, as a preliminary step we simulated smaller circuit parts of the system, which are active during logic operations. For instance, Fig. 7(b) and (c) shows an example for a two-input logic AND gate. Such circuit contains all the components involved in the computation and we also included the wire parasitic elements. Thus, we investigated the impact of the capacitance, especially by the select transistors, in-between the two memristors. We simulated circuits with both ideal and nonideal interconnections when the logic memristors have the same (AND) or opposite (XOR) polarity. We concluded that, the circuits would work properly even if the wire capacitance in-between the memristors rises up to 1000 × C w . However, the impact by the capacitance of the transistors was more important. Fig. 8 shows the simulation results for the AND and XOR gates when logic "11" input is considered. This was found to be the most critical input combination since we expect both memristors to be completely SET in one case, whereas a SET and then a RESET is expected for the second case. As seen in Fig. 8 , both logic operations are successful, besides the delayed response of the second memristor in series in Fig. 8(a) . However, the XOR computation in Fig. 8(b) would take longer to complete. This is in line with the analysis in [45] . In fact, the RC delay is more significant during the RESET process since the larger initial current flow leads to higher latency. This is exactly the case with the RPM of the XOR gate. In this logic family, the logic operation taking longer to complete will determine the pulse duration and thus logic delay. In Fig. 8(b) and in the rest of simulations we rather used a higher read voltage only for the XOR operations to produce higher input amplitude and thus achieved similar switching time for all gates, for presentation reasons. It is worth mentioning also that, with the memristors being "tightly" connected, as shown alternatively in Fig. 7(c) , we achieved a better gate response. In fact, this configuration removes the capacitance impact by the transistors in between the memristors. Therefore, with the RC delay minimized, the logic delay was also optimized; the switching time of the memristors could be as small as a few ns (by increasing the parameter a of the model to 150e8) without problem. In the proposed system, this configuration was achieved by assuming the opposite orientation for the top/bottom nanowires of the computing array, i.e., having the LILs/LOLs at the bottom of the cross-point structure and the RLs at the top. Voltages were inverted for clarity. The arrows denote the duration of the input pulses, along with the BL memory content used in each logic operation. Fig. 9 shows the simulation results for the HA circuit. As we sequentially read WL 1−4 , the logic input voltage corresponding to data in BL 1−2 and BL 3−4 was produced at the output of the summing amplifiers in the form of 2.5 µs-wide pulses. When driving WL 1 , we simultaneously drove iTSL 2 and oTSL 2 , respectively (see Fig. 7(a) ). Likewise, when driving the rest of the WLs (WL 2−4 ), we simultaneously drove iTSL 3−5 and oTSL 3−5 , whereas iTSL/oTSL 1 and 6 were never used. Cross-point cells in LIL 1 and LOL 1 correspond to the AND gate and the Carry bit voltage output is shown on the top graph of Fig. 9 . Likewise, cross-point cells in LIL 2 and LOL 2 correspond to the XOR gate and the Sum bit voltage output is shown on the bottom graph. The readout voltages for the logic '1' output reached approximately 170 mV for the XOR gate and 370 mV for the AND gate. The readout voltages confirm the expected operation of the memristive HA. The aforementioned difference in the output voltage amplitude can be attributed to the final composite series memristance in each gate, which is higher for the XOR gate; thus, a smaller voltage drops on the R PD2 resistor. The memristive HA occupies four cross-points of the computing bank each with a cell area of 9F 2 (due to the space-constraints between the TSLs), yielding a respective total circuit area of 36F 2 . Fig. 10 presents the simulation results for the SoP logic circuit. Unlike in the previous example, the RL-LIL sub-array cross-points are now 1T2R-type. The rest of the circuit remained as described in the HA example (see Fig. 7(a) ). For each WL data we simultaneously drove two AND gates and one OR gate connected via the same RL. Thus, when driving WL 1 , we simultaneously drove iTSL 1 , iTSL 2 , and oTSL 1 , respectively. Likewise, when driving WL 2 , we simultaneously drove iTSL 2 , iTSL 3 , and oTSL 2 , etc. Each LIL corresponded to one logic product (logic AND) and the logic sum of the two products was computed in the cross-point cell of LOL 1 . The cross-points of RL 5 were never used. In Fig. 10 the top graph shows the readout voltage on the resistor R PD1 . The latter reached approximately 130mV for the logic '1' output only when we applied an input combination of "11", i.e., in the first and last case, thereby confirming the expected response. The bottom graph shows the change in the memristance of the two LOL 1 cross-point ReRAM cells implementing the OR gate in RL 1 and RL 4 , where we expect to see changes in accordance with the inputs applied. Regarding RL 1 , the memristors of the AND gate in LIL 1 , which receives the "11" input, first changed their composite memristance, thereby causing the subsequent change in the memristor in LOL 1 , which corresponds to the OR gate. Similarly, in RL 4 , only the memristors of the AND gate in LIL 2 changed their composite memristance and next the corresponding OR gate memristor in LOL 1 switches its state, as expected. The minimum required pulse duration here was 5 µs as more memristors were connected in series, so the voltage drop on them decreased and their switching time increased compared to the HA example, thus affecting the logic delay. As for circuit area, the SoP implementation occupies three cross-point cells, yielding a total circuit area of 27F 2 . Overall, all the simulation results confirm the proper operation of the logic circuits, and all current sneak-paths were successfully mitigated.
IV. DISCUSSION
A comprehensive review of emerging logic design styles with memristors was recently published in [24] . However, so far the focus has mainly been on functionally complete logic gates with memristors, whereas fitting such logic design approaches in memristive arrays has not been given much attention but it is important for in-memory computing.
Considering the crossbar geometry as target architecture [46] , a few IMPLY based approaches have been published [27] , whereas recently the MAGIC-in-crossbar approach was investigated [28] . The latter was proved more delay-and energyefficient than IMPLY logic, while being also nondestructive for the input logic data. At gate-level, MAGIC [20] was compared with our logic approach in [24] . Logic computation is a two-step process in both cases, i.e., the logic memristors need to be properly initialized before computation takes place. As far as area is concerned, the number of memristors necessary in equivalent computations, was used as comparison unit. A similar number of cross-point devices were needed for corresponding n-input logic operations, thus similar crossbar area is generally required, although difficult to compute exactly. Furthermore, we carried out SPICE simulations to conclude about delay and energy consumption in corresponding logic operations. The same memristor model parameter values were used in both schemes. The logic delay was computed to decide on the duration of the applied input pulses. Our approach implied smaller voltage drop on the memristors, so they took longer to switch. Thus the input pulse width was defined based on the longer lasting logic operation and was then set equal for both schemes to compute the energy. MAGIC required approx. 68% less energy on average, considering only the energy dissipated on memristors. The latter was attributed to the higher static energy dissipation on the input memristors in our approach. However, our approach supports more crossbar-compatible Boolean operations and is less dependent on device variability. MAGIC input voltages change for different logic operations and strongly depend not only on voltage thresholds but also on R ON,OFF variability. Therefore, here we will rather focus our comparison on architecture metrics between the proposed system and the transpose crossbar array for MAGIC in [28] . Results are summarized in Table I . More specifically, logic computation in both cases is nondestructive, i.e., the input logic data are preserved. However, realizing several logic computations in parallel is not possible in [28] , unless more cross-point devices are used as memory devices to hold copies of the input data, thus impacting storage density and logic circuit area. Moreover, in the transpose crossbar of [28] the sneak-path problem is not resolved. Assertion voltages, similar to the half-select scheme in passive crossbar arrays [39] , are used to prevent undesirable logic operations on unselected cross-points. All sneak currents, however, flow through the output memristor of the logic gates, thus the current consumption of logic gates increases with the crossbar size. On the other hand, the proposed system relies on VGAA transistors as selector devices and addressed the sneak currents successfully through the TSLs and the proposed gating strategy. In this context, the MAGIC logic could be benefitted if implemented in the proposed here system other than in the transpose crossbar of [28] . Finally, another important feature has to do with device endurance. The endurance limit of memristors is estimated approximately at 10 12 write operations per cell, and is likely to increase further [1] . However, executing logic operations within memory would further stress memory cells, and thus, affect their lifetime. Therefore, endurance is something to be considered while executing inmemory logic operations in [28] . This notwithstanding, in the proposed system the memory crossbar bank is different from the logic bank. Therefore, the logic memristors undergo different stress than the memory memristors, meaning that such system could benefit from different endurance features demonstrated by different memristor device technologies. According to [47] , a hybrid system could be designed with each bank implemented by a different memory technology. For instance, the spin-torquetransfer based memory (STT-RAM), demonstrating virtually unlimited endurance, could be used in the logic bank. However, phase-change-memory (PCRAM), demonstrating better scaling performance but more limited endurance, could be instead used in the memory bank.
V. CONCLUSION
This work presented a novel crossbar-based memristortransistor ReRAM architecture enabling the integration of logicin-memory using a logic scheme for parallel and single step logic computations, enabling greatly reduced computation time and simplified logic operation with normally-off (thus more energy-efficient) dense ReRAM-based logic gates. The architecture is CMOS compatible with transistors in the bottom layer only. The twisted transistor gate nanowires and the proposed gating patterns address the current-sneak paths and permit the safe execution of an arbitrary number of logic operations in parallel on 9F 2 heterogeneous cross-point cells, using the information stored in memory as input logic data. The heterogeneity of the available devices could be achieved either within the cross-points of the same array or in different arrays which share one set of metal nanowires. Comparison proved the proposed system superior in terms of important architecture-level metrics. Future work includes implementing ALU circuits based on larger benchmark circuits. Other emerging logic styles with memristors, e.g., MAGIC, could be benefitted if implemented in the proposed system architecture.
