Internet of Things (IoT) has built a network with billions of connected devices which generate massive volumes of data. Processing large data on existing systems requires significant costs for data movements between processors and memory due to limited cache capacity and memory bandwidth. Processing-In-Memory (PIM) is a promising solution to address the issue. Prior techniques that enable the computation in non-volatile memory (NVM) are designed on a bipolar switching mode, which suffers from a high sneak current in a crossbar array (CBA) structure. In this paper, we propose a unipolar-switching logic for high-density PIM applications, called UPIM. Our design exploits a unipolar-switching mode of memristor devices which can be operated in 1D1R structure, hence suppresses the sneak current that exists in prior PIM technologies. Moreover, UPIM takes advantages of a 3D vertical crossbar array (CBA) structure to increase memory utilization per unit area for high-density applications. Our evaluation on a wide range of applications shows that the UPIM achieves up to 31.3× energy saving and 113.8× energy-delay product (EDP) improvement as compared to a recent GPGPU architecture. As compared to the state-of-the-art PIM design based on the bipolar switching mode, our design achieves 3.1× lower energy consumption.
DESIGN OVERVIEW 1.Memristor Switching Modes
There are two classes of ReRAM switching mode depending on the applied bias polarity. One is 'unipolar', where the switching between high resistance state (HRS) and low resistance state (LRS) is not relevant to the polarity of the operating voltage as shown in Fig. 1(a) , and the other is 'bipolar', where the reset switching (LRS → HRS) Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. and set switching (HRS → LRS) take place with the opposite of the bias polarity as shown in Fig. 1(b) 1.2 Unipolar-based logic within NVM Fig. 3(a) shows the basic structure of the proposed UPIM. To simplify the explanation, we show a logic that supports two-input NOR operation, but it can be extended to multi-input logics in a straightforward way. Each unipolar device consists of a memristor device and a diode. The input values are stored in two memristors, R IN1 and R IN2 , while the other memresistor, R OUT stores the computation result. The logical values are stored in each memresistor as resistance states in the input/output memristors. HRS in either R IN12 or R OUT indicates the logical value of 0, while LRS represents 1. In our experiment, we exploit the model shown in
To perform the NOR operation, our design first initializes the R OUT to R HRS . We then apply the V IN1 and V IN2 voltages to the input memristors and V OUT to the output memristor. Fig. 3(c) shows how the proposed logic performs the NOR operation. In the two-input case, the stored values in the input memristors have four combinations: 00, 01, 10 and 11. When both inputs have high resistance, i.e., '00', the voltage on the BL (V BL ) is almost pulled into ground, while the voltage across R OUT (V OUT -V BL ) is close to V OUT . Since V OUT -V BL is larger than V SET , it incurs the SET switching of the R OUT to LRS. Note that the applied voltage across the diode is negligible as compared to the voltage applied to R OUT since R OUT is previously initialized as HRS. In all other cases (i.e., 01, 10, and 11), at least one Fig. 3(d) shows the resistance behavior of the UPIM NOR gate. R OUT and R OUT ′ indicate resistance states from the output resistor prior to operation and after applying V IN , respectively. Except for the case of '00' which the SET switching occurs in R G , all the other cases keep the R G as low resistance state, presenting NOR operation.
Integration to 3D CBA structure
The proposed design executes arithmetic functions using NOR operations. Existing NOR-based approaches require additional cells to store intermediate results. The area overhead due to the generated intermediate states is not suitable for high-density applications. In this work, we utilize a 3D structure to minimize the area cost. Fig. 4(a) shows the conventional 2D logic implemented in a memory array. In this structure, the intermediate operation results are stored in the same plane while consuming extra cell area. In contrast, as shown in Fig. 4(b) , the 3D structure can store the intermediate results in a different layer. Therefore, the intermediate cell is hidden under/over the memory cells, increasing chip density as compared to the 2D case. Fig. 5 presents the comparison diagram of 2D and 3D cases. We denote the area of memory cells, which is used to store data, by A memory . A logic and A shi f t are the areas of intermediate cells for storing logic results and the interconnects, respectively. We define cell efficiency as the ratio of the memory area over the total area. In the 2D design, since the intermediate cells take chip area, the cell efficiency is represented by A memory A memory + A logic + A shi f t . In If the number of layers is n, the cell efficiency of 3D design is given by n × A memory A memory + A shi f t . This means that, with the 3D logic stacking, it can achieve high area efficiency. Fig. 6 shows our integration design of 3D logic-in-memory. The V IN and V OUT are applied to wordlines connected to memory cells and intermediate cells, respectively. For example, if the V IN is applied to 'A' and 'B' cell, the result of NOR operation is stored at a cell where the V OUT is applied. As appeared in the figure, the proposed 3D structure can improve the chip density by storing the intermediate results in a different layer compared to the 2D structure. Moreover, a memory layer and a computation layer are paired and they can be stacked with multiple layers. Therefore, our design enables parallel operation with a single input signal. In case of Fig. 6 , the UPIM NOR operations of A and B, D and E can be executed in parallel with a single PIM operation. Table. 1 summarizes the comparison of the proposed UPIM to existing technologies. Since UPIM performs logic operations in 1D1R cell structure, we achieve higher power efficiency than other PIM technologies based on the bipolar switching mode. Moreover, when implementing UPIM into the 3D CBA structure, it further overcomes the issue of the area overhead existing in the 2D PIM approaches.
EXPERIMENTAL RESULTS

Experimental Setup
Performance and energy consumption have been obtained by Cadence Virtuoso and Spectre simulators with 45nm CMOS process technology. We use VTEAM memristor model 
Energy and Performance
As discussed in Section 1.1 and 1.2, our unipolar-based logic is operated in the 1D1R structure, which shows lower static power consumption by reducing sneak current dissipation. Fig. 7 shows the energy and energy-delay product (EDP) improvements of running applications on proposed UPIM and state-of-the-art PIM designs
Process Variation
The UPIM design uses a configurable resistor, R G . To make our design robust, we determine the resistor value with consideration of process variation, which most of today's technology suffers. In our experiment, there are two major factors that induce process variation, memristor dimension, and near-far cell difference. The dimension variation comes from a diameter deviation during lithograph and etching process in the formation of pillar memristors. This results in the resistance variation on UPIM, since a memristor resistance with a cylindrical shape has an inverse dependency with its diameter Fig. 8(b) shows V BL characteristic as a function of R G , when input values are 00 and 01, considering the factors of the process variation. All V BL transfer curves are presented with dimension variation of 10%, denoted as (H). As R G increases, the electrical potential in the BL increases due to an escalation of the voltage applied to R G . V OUT −V BL has to be higher than V SET for the case of 00 and lower than V SET for other cases, i.e., 10,01,11. Thus, the gap between V OUT − V BL @10 and V OUT − V BL @00 needs to be enough wide for operation stability. The voltage gap, denoted as V BL margin, is tunable by adjusting R G value. Fig. 8(c) shows the simulation results of the V BL margin for different R G . We extract an optimized R G point from the graph of V BL margin with a R G . Based on this analysis, we choose the optimal R G value, R G,OPT , by 300KΩ to guarantee computation accuracy, despite existing process instability.
Evaluation for Area Efficiency
We evaluated area efficiency of our design as compared to the MAGIC
CONCLUSION
We present an energy efficient and high-density PIM architecture which enables logic-in-memory based on unipolar-switching memristors. The proposed design resolves the static power issue due to the sneak current by implementing the logic in the 1D1R cell structure. Our design also addresses the low cell-density of other PIM technologies due to extra area consumption for storing computation results by implementing them in 3D CBA. The experimental results show that our design presents 3.1× and 31.3× improvement in energy consumption compared to the state-of-the-art PIM designs and the GPU architecture, respectively. 
