Abstract-In recent times, Resistive RAMs (ReRAMs) have gained significant prominence due to their unique feature of supporting both non-volatile storage and logic capabilities. ReRAM is also reported to provide extremely low power consumption compared to the standard CMOS storage devices. As a result, researchers have explored the mapping and design of diverse applications, ranging from arithmetic to neuromorphic computing structures to ReRAM-based platforms. ReVAMP, a generalpurpose ReRAM computing platform, has been proposed recently to leverage the parallelism exhibited in a crossbar structure. However, the technology mapping on ReVAMP remains an open challenge. Though the technology mapping with device/areaconstraints have been proposed, crossbar constraints are not considered so far. In this work, we address this problem. Two technology mapping flows are proposed, considering different runtime-efficiency trade-offs. Both the mapping flows take crossbar constraints into account and generate feasible mapping for a variety of crossbar dimensions. Our proposed algorithms are highly scalable and reveal important design hints for ReRAMbased implementations.
I. INTRODUCTION
Traditional computing platforms require transfer of data along energy-hungry buses between the compute cores and the memory hierarchy. This has resulted in performance degradation (memory wall) leading to challenges while processing big-data [1] , [2] . Data transfer between cores and memory is often costlier than computing itself [3] . Such challenges can be mitigated by logic-in-memory (LiM) enabled devices, which can perform simple Boolean operations within the memory or very close to the memory itself. Efficient algorithms for LiM can lead to considerable improvements in performance of applications that require large memory bandwidth to process inputs [4] , [5] , [6] , [7] . Therefore, we need to build mapping tools to leverage the benefits of LiM architectures.
One of the most promising emerging non-volatile memory with computation capabilities is Resistive RAM (ReRAM). ReRAMs offer fast read/write speeds [8] , high endurance [9] , long retention times [10] along with the scope of 3D fabrication [11] . Large passive crossbar arrays can be enabled by preventing parasitic currents by means of devices such as a select device in series to a switch (1S1R) or a Complementary Resistive Switch (CRS) [12] . Unlike CRS devices, This work is an extension of the following publication. Bhattacharjee 1S1R devices offer non-destructive read outs making them suitable for logic in memory operations. ReRAMs are fast gaining popularity for use as computation devices. Recently, multiple propositions for realizing arithmetic blocks using ReRAMs have been proposed [13] , [14] . In addition, efficient implementations of encryption, data compression and linear algebra algorithms have also been mapped to ReRAMs [15] , [16] , [17] . ReRAMs have been also used for neuromorphic computation [18] . Analog non-volatile ReRAM based synapses have been used for gray-scale face classification for energy savings [19] . Even, emulation of metaplasticity has been demonstrated using analog ReRAMs [20] . Analog memristor crossbar arrays have also been used for sparse encoding of input data, which can be extended for image processing applications [21] . Further, to enable uniform analog switching, fast speed, along with excellent retention properties, a thermal enhanced layer has been proposed to confine heat in switching layer [22] . Multi-state memristors have also been used for ternary arithmetic as well as native multi-valued logic implementation [23] , [24] . Ot has been experimentally demonstrated that 3D-fabrication is feasible for resistive RAM arrays [25] .
From the perspective of computing arbitrary Boolean functions, a preliminary method for computing using memristors realizing material implication, was presented by Lehtonen et al. [26] . Further, it was shown that any arbitrary Boolean expression can be computed using two working memristors that realize material implication [27] . Logic synthesis flows have been proposed using Imply Sequence Diagram and Or-Invertor Graph for memristors realizing material implication [28] , [29] . Optimal technology mapping for ReRAM devices have been investigated for ReRAM devices, that realize threeinput Boolean majority with a single input inverted [30] . In addition, area-constrained technology mapping for individual ReRAM devices using Integer Linear Programming, along with scalable heuristics have been proposed [31] . A general purpose bit-serial Programmable Logic in Memory (PLiM) architecture was proposed [32] that uses ReRAM crossbar for data storage as well as computation. A compiler for the same was developed by Soeken et al. [33] . However, these works either consider independent devices or use serial operations on ReRAM crossbar arrays. A transpose resistive memory with additional controller circuitry, was proposed by Nishil et al. [34] , for which a technology mapping was proposed recently [35] . Inherently, ReRAM arrays support operations on multiple devices that are on the same wordline, allowing parallel operations. The ReVAMP architecture allows harnessing this parallelism by means of VLIW instructions [36] .
A ReRAM crossbar array consists of multiple ReRAM devices that share wordlines and bitlines. In this paper, we address the problem of technology mapping for computation using ReRAM crossbar array, by using ReVAMP as the target logic-in-memory architecture. The main challenge is to efficiently harness the bit-level parallelism offered by the crossbar arrays. The key contributions of the paper are as follows.
• Any arbitrary AIG/MIG with k-levels can be mapped with 2(k + 1) devices, arranged as a crossbar with at least two bitlines.
• Any Boolean expression, expressed as a Exclusive-SumOf-Product (ESOP), can be computed on a crossbar with three wordlines and at least two bitlines.
• We present two technology mapping approaches for ReVAMP in-memory computing platform.
• The area-constrained technology mapping approach uses And-Inverter Graph for logic representation and then uses a hierarchical method for generating ReVAMP instructions, aware of the crossbar dimensions. The method supports mapping to a wide variety of crossbar dimensions.
• The delay-constrained mapping approach relies on harnessing bit-level parallelism of the ReRAM crossbar array by maximizing parallel operations across multiple devices that share the same wordline. This method achieves significant lower delay compared to existing ReRAMbased serial logic-in-memory architecture. The rest of the paper is organized as follows. In section II, we present an introduction to ReVAMP, along with a brief introduction to Boolean logic networks. Section III formally presents the technology mapping problem followed by outline of the solution approaches. Section IV describes the solution for the area-constrained technology mapping. Section V presents a technology mapping solution for fast mapping by exploiting inherent crossbar parallelism. Benchmarking results are presented in section VI. Section VII concludes the paper.
II. PRELIMINARIES
In this section, we present the details of logic operations using ReRAM crossbar arrays, followed by ReVAMP -a ReRAM based general purpose computing architecture. We also summarily present the details of Boolean logic networks which will be used for technology mapping.
A. Logic in memory operations using ReRAM crossbar arrays
The ReRAM device model proposed in [37] , was fitted to a P t/(11nm)T aO x /T a cell. The used selector device is the P t/T aO x /T iO 2 /T aO x /P t crested barrier device proposed in [38] , [39] . Both devices were implemented in VerilogA and simulated using Cadence Spectre. The used ReRAM model considers a filamentary region in which the switching takes place by a redistribution of ionic defects, i.e., oxygen vacancies. The filament is modeled by three lumped circuit elements: a Schottky-type diode representing the current flow through the P t/T aO x interface, a disc resistance describing the region close to the Schottky-type interface and a resistance, which comprises the plug resistance describing the remaining part of the filament and the resistance of the electrodes. The state variable of the resistive switching model is the oxygen vacancy concentration N close to the active electrode interface, which modulates the disc resistance and the electron transport through the Schottky-type diode.
For logic operations, each ReRAM device can be interpreted as a finite-state-machine (FSM), as shown in Fig. 1 . Each device has two input terminals-the wordline wl and the bitline bl. The internal resistive state Z of the ReRAM acts as a third input and the stored bit. If the state Z is in High Resistive State (HRS), it is interpreted as logic 0, while Low Resistive State (LRS) is interpreted as logic 1. As shown in following equation, the next state of the device Z n is expressed as a 3-input majority function, with the bitline input inverted.
This forms the fundamental logic operation that can be realized using ReRAM devices. The inversion operation is equivalent to using the intrinsic function Z n with one input (wordline or state) as 0, the second input (state or wordline) at 1 and the bitline input as the variable to be inverted.
Since majority and inversion operations form a functionally complete set, any Boolean function can be realized using only Z n operations. A ReRAM crossbar memory consists of multiple 1S1R ReRAM devices, arranged in the form of a crossbar array [40] . Multiple devices share wordlines and bitlines. Fig. 2 shows a ReRAM crossbar array with 6 devices arranged in 2 × 3 configuration i.e. 2 wordlines and 3 bitlines. The internal state of device D ij at wordline i and bitline j is referred as S ij . The devices D 00 , D 01 and D 02 share wordline 0 whereas the devices D 10 , D 11 and D 12 share wordline 1. Similarly, the devices D 00 and D 10 share bitline 0 and so on. Like conventional RAM arrays, ReRAM memories are accessed as words. It should be noted that all the devices in a word share a common wordline. For example, word 0 has devices D 00 , D 01 and D 02 . The ReRAM array is programmed using a V/2 scheme, with V= 4.8V . Logic 1 and 0 are realized by voltage pulses of 2.4V and −2.4V respectively. Unselected lines are kept grounded. In a readout phase, the presence of a current greater than 4µA implies logic 1 while its absence implies logic 0. Fig. 3 shows the Cadence simulation for a single device. In cycle t1, 0 and 1 are applied to the wordline and bitline, respectively to set the logic state to 0 (HRS). In cycle t2, the device state is read out. The read out current is less than 5µA, confirming the device is in logic state 0. In the next cycle t3, 1 and 0 are applied to the wordline and bitline respectively, to set the logic state to 1 (LRS). In t4, the devices is read out and the read out current is greater than 4µA, indicating the logic state to be state 1.
B. ReVAMP architecture
We briefly present the ReRAM based VLIW Architecture for in-Memory comPuting (ReVAMP), depicted in Fig. 4 . The architecture uses two ReRAM crossbar memories -the Instruction Memory (IM) and the Data Storage and Computation Memory (DCM). The IM is a regular instruction memory accessed using the program counter (PC). The DCM hosts data and in-memory computation. All the devices in one single word of the DCM is can be operated in parallel, with each operation being the intrinsic Z n function. Since multiple Z n operations operate in parallel, the proposed architecture is Number of bits in a word in IM VLIW in nature. Splitting the instruction and data memory allows reduction in overall execution time, by pipelining instruction fetch and computation. The ReVAMP architecture is parameterized as shown in Table I , and can be configured as necessary.
The ReVAMP architecture has a three-stage pipeline with instruction fetch (IF), instruction decode (ID) and execute (EX) stages. In the IF stage, the instruction at the address held by the program counter (PC) is fetched from the IM and loaded into the instruction register (IR) before the PC is updated. In the ID stage, the instruction is read from IR to determine the control inputs for the source select multiplexer, the crossbar interconnect and the write circuit.
The data memory register (DMR) stores the data read out from the DCM. The primary input register (PIR) buffers the primary input data. Both DMR and PIR are w D bits wide. Depending on the control input M c , the source select multiplexer selects either the DMR or the PIR as the data source. Thereafter, the crossbar-interconnect is used to generate the wordline and w D number of bitline inputs by appropriate permutation of the input data, as per the control signals stored in C c . The crossbar-interconnect is basically a set of multiplixers, one per output, which selects one of the input w D bits. The write circuits reads the value of the target wordline from the register W c and the output of the crossbarinterconnect to determine and apply the inputs to the row and column decoder of the DCM. ReVAMP Instruction Set: The ReVAMP architecture supports two instructions-Read and Apply, in the formats shown in Fig. 5 . The Read instruction reads the word at the address wl from the DCM and stores it in the DMR. Now available in the DMR, this word can be used as input by the following instructions.
The Apply instruction is used for computation in the DCM. The address w specifies the word in the DCM that will be computed upon. A bit flag s chooses whether the inputs will be from primary input (PIR) or DMR. A two-bit flag ws specifies the worline input -00 selects logic 0, 01 selects logic 1, 11 selects input specified by the wb flag and 01 is not a valid input. The wb bit-vector are used to specify the bit within the chosen data source for use as wordline input. Pairs (v val) pairs are used to specify bitline inputs. The bit flag v indicates if the input is NOP or a valid input. Similar to wb, the bit-vector val specifies the bit within the chosen data source for use as bitline input. In each instruction, one bit is required to specify the opcode, and log 2 (S D ) bits are required to select the word. One bit is required for s flag and two bits are required for the wordline source select flag ws. Each (v val) pair requires one bit for the v flag and log 2 (w D ) bits for specifying the bit in the selected input source. The field wb also requires log 2 (w D ) bits. Thus, 
The word length w I of the IM should be greater than or equal to max(IL Read , IL Apply ).
We demonstate the working of the ReVAMP architecture. Let us consider a 3×2 crossbar as the DCM for realizing twobit XOR function for operands p 1 p 0 and q 1 q 0 . To compute the XOR, we use the following equation :- Fig. 6a shows the sequence of operations performed to realize a 2-bit XOR function and the steps are described below.
• Step 1: Inputs p 0 and p 1 are loaded to wordline 0 in inverted form via the PIR, since
Step 2: Wordline 0 is read out using Read instruction.
The read out values p 0 and p 1 are stored in the DMR.
• Step 3-4: The read out value is loaded to wordline 1 and 2 using two Apply instructions via the bitlines as
Step 5: Input q 0 and q 1 are ANDed with the values in wordline 2 in inverted form by using 0 as wordline input
Step 6: Input q 0 and q 1 are ORed with the values in wordline 1 in inverted form by using 1 as wordline input
The ORed values available in wordline 1 are read out, using Read instructions. • Step 8: The values in the DMR are ORed with the contents of wordline 2 to complete the XOR operations, as p i .q i + p i + q i . The set of instructions corresponding to the steps is shown in Fig. 6b . This concludes the description of the ReVAMP architecture. In the following subsection, we describe briefly structural representation of Boolean functions.
C. Logic representation
For representation of Boolean functions, we use two structural representations namely And Inverter Graph (AIG) [41] and
Step 1 Step 3 Step 4 Step 5
Step 6 Step 8 Majority Inverter Graph (MIG) [42] . An AIG (MIG) is a directed acyclic graph where each node is 2-input (3-input) representing Boolean AND (Boolean Majority). A directed edge i → j exists if the output of the (parent) node i is an input to the (child) node j. Each edge is marked as either regular or inverted. A Primary Input (PI) node is either a logic constant 0/1 or a Boolean variable. If a node is not a PI, then it is an internal node. A Primary Output (PO) node represents the output of the function. An AIG (MIG) can have one or more PO nodes. We define the level of a node n as follows.
Definition 1. The level of a node n, written as level(n), is defined as the length of the longest path from any PI node to the node n. The level of the PI nodes is zero.
Example 1. Fig. 7a and Fig. 7b shows an AIG and a MIG respectively. In both the graphs, the primary inputs (a 0 , a 1 , a, b, . . .) are shown in square boxes and the internal nodes (n1, n2, S 1 , S 2 , . . .) are shown in circles. The output nodes (n5,S 4 ) are shown in double lined circles. The inverted edges (n2 → n4, S 2 → S 3 , . . .) are marked using dots. 
III. PROBLEM DEFINITION AND SOLUTION
In this section, we present the technology mapping problem for the ReVAMP architecture along with overview of the proposed solutions.
A. Problem definition
Area constrained techhnology mapping : Given a Boolean function represented as a Boolean logic network G and crossbar dimension S D × w D , determine a sequence of instructions I 1 , I 2 , ..., I T , I t ∈ {Read, Apply} and 1 ≤ t ≤ T and PIR inputs for the ReVAMP architecture that computes the output nodes of the network G. Delay focused technology mapping : Given a Boolean function represented as a Boolean logic network G and crossbar width w D , determine a sequence of instructions I 1 , I 2 , ..., I T , I t ∈ {Read, Apply} and 1 ≤ t ≤ T , PIR inputs and number of words S D for the ReVAMP architecture that computes the output nodes of the network G.
The quality of the solution is measured in terms of the delay and the total number of devices required for the mapping. The delay of a solution is equal to the number of instructions (T ).
The total number of devices is equal to S D × w D . In this paper, we propose two different approaches to the problem. Fig. 8 shows the overall flowchart for the technology mapping problem. In the first approach, we consider the area constrained version of the problem, where we represent the Boolean function as an AIG. We begin by partitioning the AIG into k-input Look-up Tables (LUTs) . A k-input LUT is basically a function with atmost k-inputs and a single output. Once the graph has been partitioned, the LUTs for computation are scheduled in topological ordering, i.e. the LUTs close to the primary input are scheduled first and so on, till the output LUTs are computed. In order to compute a LUT, we express the functionality of the LUT using Exclusive Sum-Of-Products (ESOP) [43] . Any arbitrary ESOP can be computed on the DCM with at least 3-wordlines and 2-bitlines (explained in detail in Theorem IV.2) -the variables which have to be used in inverted form are negated first (to be applied via bitlines), followed by computing the product terms and finally XORing them. To reduce the delay, the AND computation for realizing the ESOP needs to minimize number of reads performed and maximize number of AND operations that can be done in parallel. Thereafter, we perform the XOR of computed AND terms by means of a XOR reduction tree of logarithmic depth in the number of AND terms.
In the second approach, we focus on minimizing the delay of the mapping, without any constraints on the number of words. We use MIGs for logic representation in this approach. We propose an algorithm with four phases -assignment of nodes as host or input for computation, grouping nodes to blocks, packing blocks to words followed by generation and scheduling of instructions. We explain both the technology mapping solutions in detail in the following sections.
IV. AREA-CONSTRAINED TECHNOLOGY MAPPING In this section, we establish a lower bound on the number of devices required to map any arbitrary AIG (MIG). Thereafter, we present a scalable technique for area constrained technology mapping.
Theorem IV.1. Any AIG or MIG with k-levels can be mapped using 2(k + 1) devices, arranged as a crossbar with atleast two bitlines.
Proof: Since any AIG can be expressed as MIG, we prove the theorem for MIG by means of an inductive proof. Before explaining the proof, we describe a transformation to the input MIG and prove the theorem on the transformed MIG. We transform the MIG such that
• Each internal node has a single child. Nodes with multiple fanout can be replicated bottom-up i.e., from the output to the primary inputs.
• Each node has two non-inverted inputs and a single inverted input. This can be realized by propagating the inverts across nodes or by creating an inverted copy of the node as required, using the following axiom for Boolean majority.
Now, we present the inductive proof for the transformed MIG. The device at wordline 0 bitline 0 is used for inverting any input v as needed by applying the input via the bitline with 1 as wordline input and 0 as internal state, i.e. M 3 (0, 1, v).
The inverted value v can be read out in the next cycle and used in the following cycles using Apply instructions. Any device is reset i.e. internal state Z is set to logic 0, by applying 0 and 1 as wordline and bitline input respectively.
Base Case: A MIG with 1-level basically implies inputs act as outputs and hence does not require any devices for computation. Therefore, we consider the MIG in Fig. 9a with 2-levels as the base case. One of the non-inverted input W and the inverted input B can be loaded to wordline 1. The second non-inverted input H is loaded to wordline 0. The wordline 1 can be readout and in the next cycle, W and B are applied as wordline and bitline inputs of the device holding H to compute S. Inductive Case: Let us assume that for an MIG with k-levels, the theorem holds true. Now, consider an MIG with k + 1-levels, as shown in Fig. 9b . The subtrees M IG wk , M IG bk and M IG hk have k-levels. Therefore, these MIGs can be computed using 2(k + 1) devices. Let the subtree M IG wk be computed on wordlines 1 to (k + 1) and the result W k be stored at wordline 1 bitline 1. All the devices, except the device holding W k is reset. Similarly, subtree M IG bk be computed on wordlines 1 to k+1 and the result B k is stored at wordline 1 bitline 0, followed by reset of all the devices, except wordline 1. The last subtree M IG wk is computed using wordlines 0, 2 to k + 1 with the result H k stored at wordline 1 and bitline 1. Therefore, to compute the final output T k+1 , wordline 1 is read out and then W k and B k are applied to the wordline and bitline of device holding H k to compute T k+1 . This completes the proof. We represent the Boolean function as an AIG. We partition the graph into k-input LUTs using ABC [44] . From here on, we refer to the partitioned graph as LUT graph and each node in the partitioned graph represents a LUT.
Example 2. For k = 4, the AIG in Fig. 7a can be partitioned into two LUTs, as shown by dotted lines.
Bound on number of devices required : To determine the bound on number of devices required for the storage of intermediate results, we define transient node.
Definition 2. In a LUT graph, a node n is termed as transient node in level l if node level(n) < l and there exists an edge, n → n such that level(n ) > l.
Example 3. In Fig. 10 , LUT L2 in level l − 1 has an edge to LUT N 1 in level l + 1, therefore it is a transient node for level l.
Let the number of nodes, including transient nodes in a level l be N l . We can schedule the nodes of the LUT Graph in topological ordering, i.e all nodes at level l−1 are scheduled before any node in level l is scheduled. A node in level l is dependent only on the nodes (including transient nodes) that are present in level l − 1. Therefore, once all the nodes in level l have been scheduled, the nodes in level l − 1 can be 
The memory layout of the crossbar, with S D (= t + 3) wordlines and w D bitlines is shown in Fig. 11 . The top t wordlines are used for storing the output of each LUT. The bottom three wordlines e 0 , e 1 and e 2 are reserved for computation of each LUT. For the scheduling to be feasible, M in Dev should be less than or equal to (t × w D ). If the scheduling condition is not feasible, a different value of k is used to partition the graph and feasibility is checked. Once the scheduling condition is satisfied for a given crossbar size, nodes are scheduled in topological order. The device where the output of an LUT (node in the LUT graph) would be stored, is determined according to the best fit method. The wordline in the crossbar with minimum number of free devices is chosen if the number of nodes to schedule is less than or equal to the number of free devices in that wordline. If no such wordline exists, a wordline with maximum number of free devices is chosen iteratively, till all the nodes have been allocated a device. A device storing a node n is marked dirty if all the successors of n have already been allocated. If none of the devices are free, then the wordline with maximum dirty bits is reset and allocation starts. This process is repeated till all the nodes have been scheduled, along with target device allocation. The overall technique has been shown in Algorithm 1.
Example 4. We explain the device allocation and scheduling technique, presented in Algorithm 1 using a representative LUT graph, shown in Fig. 12 . The nodes are scheduled in topological ordering. Nodes in level 1, n 1 and n 2 , are allocated to wordline 5, as shown in Fig. 13 . Node n 3 , in level 2, is assigned another device in wordline 5, using the Best-fit allocation strategy. Since the only successor of n 1 has been allocated, device allocated to n 1 is now marked dirty. In level 3, there are 3 nodes (n 4 , n 5 and n 6 ). Since there is only a single device free in wordline 5, it is not possible to allocate these nodes together. Therefore, these nodes are allocated to with LUT nodes (n 1 , . . . , n 7 ). The output of LUT n 7 is the output of the LUT graph. wordline 4. All the successors of node n 2 have been allocated, hence the corresponding device is marked as dirty. Finally, the node n 7 in level 4 is allocated to the free device in wordline 5. This completes the allocation and scheduling of the LUT nodes.
B. ESOP computation
Each function realized by the LUT can be expressed as an Exclusive Sum-Of-Product (ESOP). For many Boolean functions, minimal ESOPs have lesser number of cubes compared to Sum-Of-Products [45] . In addition, there are multiple ESOP minimizers available which can be used to reduce the ESOP size [46] , [47] , [48] , [49] . Before presenting the ESOP computation algorithm on ReVAMP, we present a brief description of the related terms.
Definition 3.
A literal is a Boolean variable either in inverted or non-inverted form.
Definition 4.
A cube is a product term composed of literals using Boolean AND.
Example 5. The ESOP abc ⊕ abc has two cubes, abc and abc. The cube abc has literal a in inverted form and b,c in non-inverted form. If the ESOP has more cubes, the next cube c 3 would be computed at wordline e 2 and bitline b 0 and XOR would be computed for c 3 and c 2 ⊕ c 1 , followed by reset. This process is repeated till the entire ESOP has been computed. Theorem IV.2. Any Boolean function, expressed as an ESOP, can be computed using three wordlines and atleast two bitlines.
Proof : We present a constructive proof for the theorem. Let us consider three wordlines, e 0 , e 1 and e 2 with bitlines b 0 and b 1 . We consider two cases. Case 1: The ESOP has a single cube, say l 1 .l 2 ...l n . If a literal l i is inverted, it is applied via bitline b 0 with '0' as input to wordline e 2 . Else, the literal is applied via bitline b 0 and '1' as input to wordline e 0 to store in non-inverted form. Then, wordline e 0 is read out and l i is applied via the bitline with '0' as wordline input to wordline e 2 . The wordline e 0 is reset. The process is repeated till all the literals have been ANDed and the computed cube is available at wordline e 2 and bitline b 0 . Case 2: The ESOP has more than one cube, say c 1 , c 2 , ..., c m . The cube c 1 can be computed, as stated in Case 1. Similarly, c 2 can be computed at wordline e 2 and bitline b 1 by applying the bitline inputs via bitline b 1 . The cubes c 1 and c 2 can be XORed as shown in Fig. 14 with the result stored at wordline e 2 and bitline b 1 . Rest of the devices are reset to 0 by using 0 as wordline input and 1 bitline input. Now, the third cube c 3 can be computed, using steps identical to Case 1 and the XOR can be performed with the result c 1 ⊕ c 2 . This process can be repeated till the entire ESOP has been computed.
The theorem IV.2 guarantees that any ESOP can be computed in a crossbar with three wordlines and two bitlines. If the number of bitlines is greater, it is possible to reduce the delay by parallising operations. Boolean AND of two literals a and b can be expressed as M 3 (a, 0, b). 0 can be used a common wordline input during computation of cubes in parallel feasible. Fig 15 shows the computation of the cubes of an ESOP. Due to the crossbar constraints, all the bitlineapplied literals must be either available via the PIR or DMR simultaneously. This implies that all the applied literals either have to be primary inputs or must reside on the same wordline for parallel computation of the cubes. At the end of completion of computation of the cubes, the cubes have to be XORed. Each XOR can be performed using that steps similar to the example shown in Fig. 6a . Multiple XORs can be performed by means of a XOR reduction tree with logarithmic depth, in the number of terms to be XORed. In Fig. 16 , there are four terms x i to be XORed. The XOR of x 1 and x 2 can proceed in parallel with the XOR of x 3 and x 4 . Thereafter, the results x 12 and x 34 are XORed. It might happen that the numyclesber of cubes in an ESOP is greater the number of available bitlines in the crossbar. In that case, the computation of the cubes, followed by XOR reduction has to be iterated. The technique for ESOP computation is presented in Algorithm 2. Once the ESOP has been evaluated, the result is written back to the position in the working area, as determined by the scheduling algorithm. Discussion: The proposed approach provides a novel solution to the area-constrained technology mapping problem. The target Boolean function is represented as an AIG, followed by partitioning into k-input LUTs and finally scheduling and computing these LUTs on the crossbar. The approach allows a feasible mapping for a variety of crossbar sizes, with some portion of the crossbar reserved for computation of ESOPs.
Instead of using AIGs for representing the functions, it is also feasible to represent the function using Majority Inverter Graph (MIG). The native function realized by ReRAM devices is Boolean Majority three (M 3 ) with an input inverted. Therefore, MIGs have been used heavily in synthesis [50] , [51] and technology mapping [30] flows for ReRAM crossbar array. In the next section, we discuss another approach to the technology mapping problem using MIGs for logic representation and constrained by only the word length of the DCM with focus on reducing the delay of mapping. 
V. DELAY-CONSTRAINED TECHNOLOGY MAPPING
In this section, we present a method to generate instructions for the ReVAMP architecture that is focused at reducing the delay of mapping without constraints on the number of words required for mapping. In this method, we still consider the constraint on word length w D during mapping.
A. Assign Host and Inputs to Nodes
A ReRAM device has an internal state Z, and two inputlinesthe wordline and bitline. A computation on it updates its internal state Z, in effect making the device the host for the computation. For each internal node in an MIG, one of its parents hosts the computation and the remaining parents act as wordline and bitline inputs. The computation of multiple independent nodes can be grouped into an Apply instruction if they have a common wordline input. Based on this, we present a few rules to assign the host and the inputs of the nodes of an MIG.
• If a node has multiple children in the same level, then it can be used as common wordline input for computing those nodes. For instance, in Fig. 7b , input b can be used as common wordline input to compute S 1 and S 2 .
• If an incoming edge to a node is marked inverted, then the corresponding parent can be used as the bitline input.
In Fig. 7b , c and S 2 are used as bitline inputs to compute S 1 and S 3 respectively.
• If there are no inverted incoming edges to a node, then a negated parent is used as input to that node. For node S 2 in Fig. 7b , input c is used as bitline input.
• The remaining parent is used as host for the node. The nodes a and S 1 act as host to compute S 1 and S 3 respectively in Fig. 7b .
These rules ensure that the nodes with common inputs can share wordline inputs which is used for scheduling computation. We mark these assignments on the edges of the MIG, as shown in Fig. 7b .
B. Group Nodes to Blocks
To compute an internal node in a MIG, we need to read out the wordline and bitlines inputs of the node and then apply these inputs to the host. Given that only a single word can be read out in a clock cycle, the wordline and bitline inputs of the node must reside on the same wordline to allow efficient computation of the node. This creates a constraint that for each node in an MIG -the wordline and the bitline inputs should be placed in the same word. We call this grouping a block. Further, as read-outs are non-destructive, blocks can be merged if they have common inputlines. This reduces the number of devices required, with the merged block having only one copy of the common inputline. Note that blocks can be merged only if the number of inputs in the resultant block does not exceed the word length.
Also, a pair of blocks in the same level that have hosts which share a wordline input should be merged. This hostbased merge along with merge of the corresponding blocks with the inputlines of these hosts permits computation of the nodes in the same level with shared wordline in a single cycle, thereby reducing delay. The algorithm of the block formation is shown in Algorithm 3. The lines 2-5 creates the blocks considering the placement constraint on the input lines of the output nodes. The addInversionBlock method adds the positive nodes as blocks to the blockList, if the added blocks have inverted values. Only a single positive node is added to blockList, corresponding to multiple copies of a negated node. The mergeBlock method merges blocks based on the input line and host based merge constraints. The replace method replaces a node in a block with its host node.
Example 6. For a word length (w D ) of 3, Table II shows the working of the block formation algorithm on the MIG of Fig. 7b . Starting at the output node, blockList has a single block. At level 3, node S 4 is replaced with its host and inputlines. Since these two blocks do not have any common inputlines or hosts, they cannot be merged. At level 2, node S 3 gets replaced and the inputlines are added to a new block. At level 1, nodes S 1 and S 2 are replaced by their hosts a, and the inputlines are inserted in two new blocks. Blocks 4 and 5 have a common inputline b and are hence merged. Blocks 2 and 4 have common inputs, but cannot be merged as the length (four) of the resultant block will exceed the given word length. Thereafter, since the two a host nodes have the same wordline, blocks 1 and 3 get merged, but both copies of the host are retained, using the host-merge constraint.
C. Pack Blocks in Words
At the end of scheduling computation, we have blocks of elements, which have to be placed in the same wordline. The number of elements in each block is less than or equal to w D , the number of bits in a word. Now, these blocks have to packed in the DCM using the minimum number of words. The problem can be formulated as a bin packing problem as defined below. Consider each word in the DMR as a bin, with capacity w D . Each block b i has a value v i , v i > 0. Each block must be assigned to a bin such the total value of the objects assigned to the bin is less than or equal to w D . The objective is to minimize the number of bins required to assign all the block, without violating the capacity constraint.
This first-fit algorithm provides a 2-factor approximation, i.e., the number of words required by the algorithm is at most twice the number of words required by the optimal solution.
Example 7. For the example, the blocks determined by the Block Formation algorithm are placed in a separate wordline, as shown in Fig. 17 (a) .
D. Generation and Scheduling instructions
The primary inputs have to be loaded into the DCM before computation of the internal nodes of the MIG can begin. In each clock cycle, w D primary inputs can be read. The primary inputs are loaded via the bitline and hence the inverted values are stored in a single clock cycle. To store non-inverted primary inputs, the primary inputs are written to a wordline, thereby storing it in inverted form. Then, the inverted value is read out and applied via the bitline to store the non-inverted value to the required wordline. A single extra wordline is used for storage of the intermediate inverted primary input, and this wordline is reset, after each use.
All the nodes in level i are scheduled for computation before any node at level i + 1 is scheduled. The nodes in the same level can be scheduled in any order as they do not have any data dependencies. The nodes in a level with hosts of the which are in the same block, and the corresponding inputlines are also placed together in the same block, are scheduled for computation together. Once all the nodes in a level have been computed, we determine whether any inverted copies of the nodes are required for computation of nodes present at a higher level. If inverted copies are needed, the node is read out and stored in inverted form in the required block by writing through the bitline. Each computation is expressed as an Apply instruction and read operations are expressed as Read instructions. Example 8. Table III shows the sequence of instructions used to compute the example MIG, and Fig. 17 shows the changes in DCM state on application of the Apply instructions. Note that the additional instructions needed to initialize the DCM are not shown. The inputs to compute nodes S 1 and S 2 are in word 0 and are read out. The hosts of nodes S 1 and S 2 are in word 2, and therefore I 2 computes these nodes in word 2. The inputs to compute S 3 are in word 2, and are read out by I 3 . I 4 computes S 3 in host S 1 . Finally to compute S 4 , I 5 reads out word 1 and I 5 applies the required inputs to S 3 .
Discussion: Even though the two approaches have been discussed with AIG and MIG as the input data structures, the data structures can be used interchangeably. To use MIG in the area-constrained mapping approach, the MIG can be directly partitioned into LUTs and the rest of the mapping flow can be used. Similarly, the AIG can be converted to an MIG by introducing constant '0' as the third input to each node, and the rest of the delay-constrained mapping flow can be used. Due to the inherent sequential nature of computation on and the crossbar constraints, employing traditional synthesis optimization techniques, such as depth reduction, do not directly translate into lower delay after technology mapping. However, it is possible to make the synthesis optimization technique technology-aware to aid the technology mapping flow, as demonstrated recently by Bhattacharjee et al. [52] .
VI. EXPERIMENTAL RESULTS
We have implemented the proposed compilation flow for the ReVAMP architecture using Python. The algorithm was evaluated using the EPFL benchmarks 1 . For area-constrained mapping, we used ABC for generating the initial AIG and also for ESOP expansion [44] . Each run is limited to 2 hours, exceeding which the program is terminated. The major amount of time in mapping is spent in ESOP expansion.
For all the EPFL benchmarks, Table IV presents the results of the area-constrained mapping for varying number of LUT inputs k for fixed crossbar dimension of 64×64. With increase in k, the number of LUTs (#N LU T ) in the LUT graph reduces, along with reduction in the number of levels (#L). For the given crossbar dimensions, 61 words are available for storing the intermediate results and 3 are reserved for computing the ESOPs. Some of the benchmarks could not be mapped (marked by ××) due to violation of the feasibility criteria (M in Dev > 3904), presented in Equation (7) .
To analyze the impact of increasing number of LUT inputs (k) on delay and M in Dev in detail, we consider four large benchmarks from the EPFL benchmark suite for crossbar dimension 64×64. The results are shown in Fig. 18 . The effect of k on M in dev is dependent on the benchmark itself. For example, with increase in value of k, M in dev for the benchmark mul32 decreases but for mem ctrl, M in dev increases for larger values of k, as evident from the Fig. 18 . The delay of the mapping (in terms of number of cycles #C) closely follows the trend of M in dev i.e. with increase in M in dev , the overall number of cycles required for mapping increases. This is because with increase in M in dev , less number of crossbar devices can be reset at any given time, which leads to reduction in the parallelization of operations during the ESOP computation. For the benchmark mul32, notice the sharp rise in delay of mapping on changing k from 13 to 16. The number of cubes in the ESOP expression for increased consistently, resulting in the increased time of computation of the ESOP expression, that increases the overall delay of mapping. Also, for large values of k (k ≥ 28), the time required for each ESOP expansion increases considerably (> 2 hours), which leads to long execution time for mapping an entire benchmark.
We analyze the impact of crossbar dimension on the delay of mapping. Keeping the overall number of devices fixed to 4096, we vary the number of bitlines from 4 to 1024. The results are shown in Fig. 19 . With increase in the number of bitlines, 1 http://lsi.epfl.ch/benchmark the crossbar permits greater number of parallel operations that can be carried out in a word. This parallelism is harnessed by the ESOP computation technique, which leads to reduction in delay of mapping for the entire benchmark.
The results of delay constrained mapping are presented in Table V for word length (w D ) of 16-bits. For most of the benchmarks, the compilation time to generate the instructions, was a few seconds while for the larger benchmarks, the compilation process finished under 20 minutes. The number of Read and Apply instructions are shown in column I A and I R respectively while the total number of instructions is I T otal . The number of blocks created by mapping is #B. The total delay (#C) of the mapping solution is the number of cycles to complete computation of the benchmark by the ReVAMP architecture. Gaillardon et al. [32] proposed the PLiM computer, which has a single instruction -RM 3 A, B, Z. Assuming 16-bit words, each instruction results in the following micro operations on the memory array: Read @A (32 bits), Read @B (32 bits), Read @Z (32 bits), Read A (1 bit), Read B (1 bit), Write @Z (1 bit). This corresponds to 9 R/W cycles on the considered machine. Therefore, minimum number of cycles D P * required by PLiM [32] to compute any MIG is 9#N, where #N is the number of nodes in MIG. This delay does not include the additional delay required for computing the negated valued of the nodes. For each benchmark, #C is significantly lower than the D P * achieved by the PLiM computer. This is a fair comparison since PLiM computer also used a word-length of 16-bits. The ReVAMP architecture outperforms PLiM computer for the same 16-bit word by a factor of 4.38× on average and 9.5× at the maximum. For the ReVAMP architecture, on average, almost 30% of the computation time is spent in computing negated value of the nodes. Thus, the ReVAMP architecture would further outperform the PLiM computer, when the actual number of cycles required by PLiM computer for computation with negations will be considered. Synthesis techniques can be used to reduce the number of nodes in the MIG for reducing the delay of executing a benchmkark on the PLiM, as suggested in [33] , but similar techniques can also be used for optimizing the input data structures of the proposed technology mapping flow [52] .
In Fig. 20a , the speed up achieved by the ReVAMP architecture against the PLiM computer is presented for various word lengths. Even for a small word length of 4, the ReVAMP architecture gains in performance over the PLiM architecture by a factor of 2.9× on average. This shows that harnessing the inherent parallelism of ReRAM crossbar arrays for computation provides considerable performance gains. This justifies the VLIW nature of the ReVAMP architecture and demonstrates the effectiveness of the delay constrained mapping. The number of words (S D ) used determines the area of the mapping solution. To determine the effectiveness of the packing algorithm to pack blocks into words, we utilize the word utilization (W U til ) metric. W U til is the percentage of total number of bits in S D words, that are used by the mapping solution. For the example MIG, out of 9 bits (3 words, each with 3 bits), 8 bits are used and therefore W U til is 88.8%. The proposed packing algorithm achieves more than 97% utilization for all the benchmarks, when w D = 16, including 100% utilization for the revx benchmark. Fig. 20b shows that with increase in word length from 4 to 8, leads to considerable improvement in W U til . However, the W U til is comparable for the word lengths, 8, 16 and 32, and approaches 100%. This shows the effectiveness of delay-constrained mapping to pack the blocks into words.
VII. CONCLUSION
In this work, we presented two approaches to the technology mapping problem for logic-in-memory computation using ReVAMP. The area-constrained method allows high flexibility of addressing the need of mapping to a variety of crossbar dimensions while harnessing the available parallelism of the ReRAM crossbar array. The delay-constrained method reduces the overall delay of mapping by using a multi-step approach that takes into account crossbar constraints while placing the operands. The proposed approach outperforms the state-of-theart serial logic in memory approach using ReRAMs.
The synthesis approaches used in the technology mapping flow, such as partitioning algorithm used for LUT mapping, are not aware of the crossbar constraints. The LUT partitioning algorithm should ideally try to partition the graph, so that each of the ESOP expression corresponding to each LUT has roughly the same number of cubes, instead of solely minimizing the number of LUTs covering the graph. Also, the initial representation of the Boolean functions into AIG/MIG are not explicitly optimized w.r.t to the quality of the resulting mapping. We believe optimizing the synthesis algorithms w.r.t to the crossbar constraints, would allow further reduction in delay of mapping, when combined with the proposed technology mapping approaches.
(a) 
