Neural networks are used in a wide range of applications, such as speech recognition and image processing. There is a strong motivation to improve the performance of these applications due to their industrial and commercial significance. Recently, binary neural networks have shown impressive efficiency and accuracy on image recognition data sets. The nature of the operations performed in these algorithms lend themselves well to specialized hardware and processing-inmemory (PIM) approaches. In this paper, we introduce a spintronic, reconfigurable in-memory accelerator for binary neural networks, NV-Net. NV-Net is capable of being used as a standard STT-MRAM array and a computational substrate simultaneously and allows for massively parallel and energy efficient computation. We evaluate NV-Net using multiple image classifiers and a genomics kernel for similarity matching. Our simulation results show that NV-Net is more energy efficient than alternative CPU, GPU, and FPGA based implementations and is also capable of a higher throughput.
INTRODUCTION
Neural networks have gained renowned attention in solving a diverse set of problems. A wide range of applications rely partly or completely on neural networks. Thus, there is great incentive to improve the efficiency of neural network computation. Many different, energy efficient and high performance, hardware implementations have been introduced, which usually adapt one of the following methods:
• A pure floating-point heavy accelerator, capable of handling the main computations of neural networks [5, 31] . This can be applied to the entire network or advanced training techniques can be used to find the minimum bitwidth for each layer [12, 14] . • Weight sharing, where a single value is used for multiple weights in the network. This reduces the amount of data that has to be stored on chip [3] . • Using pruning to make the weight matrices sparse. Sparse matrix algebra reduces the number of operations required and improves performance [13, 15] . While these methods can significantly improve efficiency, neural networks remain power and memory intensive applications due to large data sets and the large number of operations required to process each input. Typical network sizes are too large to store on chip, and thus require a supporting memory structure, typically DRAM, which has a high latency and energy cost.
Binary Neural Networks (BNN) represent a recent surprising add-on to the accelerator design space, which trade off accuracy for efficiency by representing each weight with only a single bit: 0 representing the value -1 and 1 representing the value +1. This significantly reduces the memory space required for neurons and weights. In comparison, fixed point representation of weights is typically 8 to 32 bits.
Effective methods for BNN training exist, to render near state-of-the-art accuracy in the MNIST (hand-written digit recognition), CIFAR-10 (image classification), and SVHN (street view digit recognition) datasets [7] . Other BNN proposals such as XNOR-Net [21] and DoReFa-Net [33] , are based on AlexNet, and use the much larger ImageNet dataset for image classification. BNNs tend to suffer from a significant loss in accuracy for this larger problem set, unless some form of compensation takes place.
Perhaps the most significant advantage of binarization is simplification of the underlying hardware, which improves computational efficiency. During the forward pass -which is at the core of inference but also required for training, most arithmetic operations can be replaced with bit-wise operations [7] . For example, a multiplication operation can be simplified to an XNOR gate, which has much lower latency and energy cost. This is significant, considering the most common operation in NN computation: multiply-and-accumulate (MAC). Such hardware optimizations directly translate into better performance and energy efficiency, as exploited recently by FPGA-based BNN acceleration [18, 27] to render faster and more energy efficient execution than CPU-and GPU-based counterparts.
When it comes to hardware acceleration for BNN, many of the emerging Processing In-Memory (PIM) solutions, that can perform bit-wise operations inside the memory array [17, 23, 11] are generally suitable. BNN computation consists largely of bitwise binary operations, however, support is necessary for non-binary operations such as pop-count and thresholding, as well. These can be implemented with relative simplicity using sequences of multiple gates, subject to the limitations of the underlying PIM technology.
Using PIM for BNN acceleration circumvents expensive data transfers to/from memory. BNN configuration (filters, weights, thresholds) does not change during inference, but this typically cannot be fully exploited by non-PIM accelerators: the on-chip memory is usually too small to keep all configuration parameters, thus repeated data transfer to/from off-chip memory becomes inevitable. Under PIM, a sufficiently large memory array, on the other hand, renders such data transfers unnecessary, by construction.
In this paper, we introduce NV-Net, a PIM-based, reconfigurable BNN accelerator for forward propagation. Without loss of generality, we adapt a recently proposed non-volatile (spintronic) PIM technology [6] , which provides a better energy efficiency and storage density trade-off when com-pared to volatile alternatives. A key contribution of our work over [6] is introduction of novel architectures for the memory cell (augmented with compute capability), which in turn resulted in novel array architectures. The basic design from [6] cannot support BNN operations otherwise.
NV-Net can serve as a standard STT-MRAM array and a computational substrate simultaneously, and is capable of massively parallel and energy efficient computation. We implement both MNIST and CIFAR-10 classifiers along with a genomics kernel for similarity matching in NV-Net arrays and use representative FPGA-based BNN accelerators as baselines for comparison, which can achieve significantly higher throughput and energy efficiency than competing CPU and GPU based implementations.
In the following, Section 2 covers the background; Section 3, the proposed NV-Net design; Sections 4 and 5, the evaluation; Section 6, the related work; and Section 7, a summary of our findings.
BACKGROUND 2.1 Binary Neural Networks (BNN)
Neural Networks (NN): Both fully-connected and convolutional NNs can be used as classifiers. Fully-connected NNs consist entirely of fully-connected layers. Each fullyconnected layer is one-dimensional, i.e., the neurons are arranged in a single line. Every neuron in each layer is connected by a weight to every neuron of the preceeding layer. Thus, the input for each neuron is the entire preceding layer. Say w i, j is the weight from neuron i in layer l − 1 to neuron j in layer l. If there are N l−1 neurons in layer l − 1, a weighted sum s for neuron j is computed by:
The final value of j in this case becomes a non-linear function f of s under the bias B l : j = f (B l + s). Common choices for f are sigmoid, tanh, htanh, ReLu, and sign functions. To compute the entire layer l, this computation gets repeated N l times. Once the entire layer is computed it is used as input for the proceeding layer.
Convolutional networks contain convolutional and pooling layers, in addition to fully-connected layers. The input and output of convolutional and pooling layers are threedimensional collections of neurons, called feature maps (fmaps). Each neuron in an fmap has 3 coordinates -(x, y, z) corresponding to (width, height, depth), respectively -rather than a simple index, as in fully-connected layers. It is this spatial arrangement of neurons that enables convolutional networks to detect patterns, which is the key to effective image recognition.
Convolutional layers use filters, which simply are threedimensional collections of weights. The height and width of filters can vary in size (typically 3 × 3 to 11 × 11), however, filters typically contain weights to every layer of the input fmap. For example, if the input fmap has a depth of 3 (such as an RGB image) and the filter being applied is 5 × 5, the filter will be a 5 × 5 × 3 collection of weights, containing 75 weights in total. Applying the filter at one position produces one neuron of the output fmap. The application of the filter is identical to the application of the weights in a fully-connected layer, except that there are weights to only a subset of the input neurons.
The filter is positioned on top of the input fmap, overlapping a subset of the neurons, and each weight multiplies the neuron it overlaps with. The sum of these products, s, is then used as input to some non-linear function, in the same way as in fully-connected networks. An example is shown in Fig.1 . Moving the filter over the input fmap to repeat this procedure generates an output neuron at each position and thereby pro- duces one layer of the output fmap. Using multiple filters in a similar fashion produces multiple layers of the output feature map.
The stride, the distance the filter moves between each position, and whether the filter slides over the edges of the input fmap, determine the height and width of the output fmap. If the stride is equal to 1, and the filter is allowed to slide over the edges of the input fmap, the output fmap becomes of the same height and width as the input fmap. To allow the filter to go over the edges, typically the input is 0padded, i.e., any weights that lie over the edge are multiplied by 0.
Pooling layers down-sample fmaps. In a pooling layer, the fmap is divided by width and height into multiple sections. A typical pool size is 2 × 2. The largest value in each section (in each layer) is kept and the rest are discarded. Thus, each section reduces to a single neuron. A pool size of K × K reduces the height and width by a factor of K. Since pooling applies to all depths of the fmap, the depth remains unchanged. Binarization: Reducing the bit precision is a common technique to improve NN efficiency, including binarization, where representations for all neurons and weights reduce to only one bit [7] . This greatly simplifies the type of computations required. In standard NNs, most of the operations comprise high latency and power intensive multiply-and-accumulate (MAC), where the values of neurons are multiplied with the values of weights, which are then summed and transformed non-linearly (typically via application of sigmoid function). For binary NNs, bit-wise XNOR operations replace multiplications. The resulting bits are then summed and compared to a threshold value. Due to their simpler nature, these operations can be performed much more quickly and energy efficiently. Note that this applies only to forward propagation. Inference consists entirely of forward propagation and thus can fully exploit these benefits. When training, additional non-binary parameters must be maintained and updated. We focus on inference in this work and assume that training is performed offline in software. That said, training can still benefit from more efficient forward propagation as enabled by NV-Net. Without loss of generality, we adapt Computational RAM (CRAM) as the base spintronic PIM substrate in this study [6] . The structure consists of an array of magnetic tunnel junctions (MTJs) and can be used as a standard STT-MRAM memory array. Due to desirable properties of MTJs, the memory is fast, low power, high density, and non-volatile. Due to nonvolatility, standby power is near zero. At the same time, MTJs are inherently compatible with logic operations. This enables computation to take place entirely inside the memory array, without the use of external logic or sense-amplifiers. This provides true PIM capability. Further, the structure of the array allows for massively parallel computation.
Spintronic Processing In Memory (PIM)
As we will detail in the following, however, the basic array structure from [6] cannot support BNNs. Building upon this limited basic design [6] , a key contribution of our work is design space exploration for the memory cell architecture (augmented with compute capability), which resulted in novel memory cell and array architectures enabling efficient BNNs. Cell & Array Architecture (2T1M) [6] : We consider three different configurations of the memory cells, which range from 1T1M (one transistor per magnet) to 3T1M (three transistors per magnet). Magnet in this context refers to the magnetic tunnel junction (MTJ), the key building block of the memory cells. MTJs are resistive memory devices. Each MTJ has two magnetic layers, a fixed layer and a free layer. The polarity of the free layer can change but the fixed cannot. When the two layers are aligned, the MTJ is in the parallel (P) state, which is considered to be logic 0. If the magnetic layers are not aligned, the MTJ is in the anti-parallel (AP) state, which is considered to be logic 1. The MTJ has a much higher resistance in the AP state than it does in the P state. It is these two resistance levels which represent logic 1 and logic 0. State can be changed by passing currents through the device, where the direction determines the final state. Fig.2a shows the default 2T1M cell architecture from [6] , which we include here as a baseline for comparison. The array is identical to a standard 1T1M STT-MRAM array, except for the additional access transistor per cell; the control signal Logic Line (LL), which runs along the rows; and the control signal Bit Line for Logic (BLL), which runs along the columns. Memory & Logic Semantics (2T1M) [6] : If no computation is taking place, the array from Fig.2a can serve exactly like an STT-MRAM array. In this case, the WLM signal activates the access transistor t M (transistor for memory), that connects MTJs to the bitline bar (BLB) on a per row basis. Cells can then be read or written via the BL and BLB. The extra hardware enables computation within the array. The second access transistor t L (transistor for logic), connects the MTJs to the LL. BLL controls this second transistor t L . When BLL is activated for multiple columns, the MTJs in those columns, that are also in the same row, are connected together via the shared LL. MTJs that are connected over LL in a row can act either as inputs or outputs of a logic gate. Hence, the result of the logic gate directly changes the state of the MTJ designated as the output. Voltages applied to BLs can enforce a specific switching activity for the output MTJ, which evolves as a function of the resistances (i.e., logic states) of the input MTJs. In other words, voltages applied to BLs can control the switching of the output magnet according to a specific truth table. A specific voltage range corresponds to each logic gate. This structure can support logic gates of various numbers of inputs.
This design is limited, however, as any logic gate activated in a row also gets activated in all rows in the array. This is because BLL (which activates cells to serve as logic gate inputs or outputs) runs through all columns in the array. If BLL is set for a column, MTJs in that column, in all rows, get connected to LL. While such massive row level parallelism may be desirable, it impairs direct adaptation for BNN processing.
Cell & Array Architecture (3T1M) for NV-Net: To satisfy true PIM semantics, we need to perform computation in only a subset of the rows, while leaving all other rows unperturbed, which is not the case for [6] . To achieve this, we can add a 3rd access transistor, as shown in Fig.3a and 3b. This design preserves the very same memory interface as a standard STT-MRAM otherwise.
For logic operations, t L from Fig. 2a now becomes t LC (transistor for logic column). BLL controls t LC . The new signal Wordline for Logic (WLL) controls the third transistor, t LR (transistor for logic row). t LR is in series with t LC , thus, both t LC and t LR must be activated to connect the MTJ to LL. This enables a row-wise and column-wise specification for any MTJ to serve as an input or output to a logic gate. BLL and WLL determine this specification. As all MTJs in a row share the same LL, still, only one logic operation can be performed in each row at a time. However, each operation can be performed in any number of the rows simultaneously. Hence, the array still features row level parallelism. We next take a closer look into memory and logic semantics. Memory & Logic Semantics (3T1M) for NV-Net: Fig.3a shows a 3T1M NV-Net cell activated for memory. WLM is set for only one row. This activates t M and connects MTJs to the BLBs. MTJs can then be read or written via voltages applied to BL and BLB. In this configuration, the array acts exactly as a standard STT-MRAM array. For data retention, i.e., if no read or write access is the case, keeping WLM, WLL, and BLL at logic 0 suffices. Fig.3b shows a 3T1M NV-Net cell activated for logic. WLL is set for all rows in which computation should occur. We follow the same method as [17] , and activate rows sequentially. This activates t LR for all cells in the row. BLL is set for all columns that contain inputs and the output. This activates t LC for all cells in the corresponding columns. Since t LR and t LC are in series, only MTJs that are in cells with both activated get connected to LL. Voltages are applied to the BLs then to specify the type of the logic operation. Cell & Array Architecture (1T1M) for NV-Net: While the 3T1M design enables selective computing in the NV-Net array (which is not the case for [6] ), it incurs the area overhead of an extra transistor. We next present an area efficient alternative. By taking the transpose, i.e., rotating the logic functionality relative to the direction of memory operations, we can reduce the number of transistors from 3 to 1, as shown in Fig.2b . The two MTJs are in adjacent cells in the same column. BL and BLB are replaced with Bitline Even (BLE) and Bitline Odd (BLO). Adjacent MTJs are connected to different bitlines in this case, BLE and BLO, respectively (and the connection alternates throughout the column). Logic Line (LL) serves both as the connection between MTJs for logic operations, and also as input for read and write operations. Wordline (WL) controls the single access transistor in each cell, and is used to connect the MTJ to LL. Memory & Logic Semantics (1T1M) for NV-Net: When no computation takes place, performing memory operations entails first setting WL. Specifically, WL gets activated for only one row. This activates the access transistors and connects MTJs (in the respective row) to LL. Voltages can then be applied to LL and either BLE or BLO (depending on the parity of the respective row) for read and write operations. When compared to the baseline 2T1M design from Fig.2a , for memory access, LL and either of BLE/BLO can be thought of being equivalent to BLB and BL, respectively.
For logic on the other hand, WL is set in multiple rows which connects multiple MTJs in a column to LL. This creates a connection between all activated MTJs that are in the same column. Voltages are applied to both BLE and BLO in the columns where logic gate inputs and outputs reside. These voltages determine the type of logic gates performed, in similar nature to [6] .
This design has the extra restriction that inputs must be in rows that are all connected to the same type of bitline (i.e., either to BLE or BLO) with the output connected to the other type: If all inputs are connected to BLE (BLO), the output must be connected to BLO (BLE). This is because all input MTJs must be in parallel with each other and in series with the output MTJ. This may impact data layout, depending on the algorithm being mapped to the array. However, due to the reduction to one transistor, the resulting array has nearly the same density as a standard STT-MRAM array.
NV-Net
We will next look into basic computational BNN building blocks and data layout in the NV-Net array along with design optimizations. Without loss of generality, in the following, we will use the 3T1M NV-Net as a running example. Transposed 1T1M NV-Net simply rotates the sense of logic operations to occur in columns rather than in rows, and is logically equivalent to 3T1M.
BNN Building Blocks
By construction, NV-Net can perform any logic gates that CRAM can perform [6] , as the principle for gate formation and logic operation is the same in spite of the differences in cell and array architectures. This translates into a universal set of gates including NOT, NAND, NOR, AND, OR, and MAJ(ority). The number of inputs is arbitrary but limited by voltage variation (Sect.4.1). The operating principle is as follows, irrespective of the cell type: All magnets to serve as either input or output of the logic gate being formed are connected to Logic Line LL. The logic states correspond to the resistance levels of participating magnets. This always renders the same topology of a resistive network: all input magnets in parallel, connected in series to the output magnet. Voltage V applied on the respective bitlines, to set the gate type G, forms a voltage differential across this resistive network. The MTJ corresponding to the gate output is preset to a known value. V forces the gate output to switch or preserve its value as a function of the state of the inputs, according to the truth table of G. More specifically, V (along with the state, i.e., resistance levels of the input magnets) determines the current through the output magnet, which results in switching if this current exceeds the switching threshold. XNOR is a critical component of binarized NN implementations. The output of XNOR is 1 when the inputs are the same and 0 otherwise. NV-Net cannot support XNOR operation in a single step (following the resistive divider based principle). However, NV-Net can perform XNOR using a sequence of NOR gates, following:
) . This process requires four NOR gates and thus four steps to complete, in addition to three temporary values, as shown in Fig.4 . Two NOT and three NAND gates can implement XNOR, as well. Addition is implemented by the ripple-carry algorithm, without loss of generality, where the output is computed one bit at a time. The first step is a half add and all remaining steps are full adds, which consist of NAND and NOT gates. Each full add requires four temporary values, including the carry bit, and takes a total of 5 steps. Thus, addition of two n-bit numbers requires 5n steps and 4n temporary values. Using only NAND gates, addition takes 9n steps. Comparison to a threshold value and subsequent sign operation are commonly used in BNNs, as the correspondent of the non-linear function f from Sect. 2.1 for ordinary NN. We implement this by subtracting the threshold value from the input and then taking the sign of the result. The sign is equal to the inverse of the borrow out signal when subtracting the threshold from the input. Thus, we do not need to perform the actual subtraction, and we only need to compute the borrow out signal. To this end, we use the ripple-borrow algorithm where a full-subtractor is implemented for each bit of the input. However, for each bit, we do not compute the difference bit, but only the output borrow, B out . Given x, a bit of the threshold value, and y, a bit of the input, the equation for B out in NAND form is as follows:
where B in is the input borrow signal, which is set to 0 for the least significant bits. The B out signal generated is used as the B in for the next full-subtractor. This is repeated for all bits of the input. The final B out bit is then inverted to produce the sign bit. Computing B out for each bit requires one NOT gate, 4 NAND gates, and 4 temporary values. Thus, comparing two n-bit numbers takes 5n+1 steps and 5n temporary bits. Popcount takes a list of bits as input, and produces an integer equal to the number of 1s in the list. NV-Net implements popcount with a sequence of adds. Additions are scheduled in a hierarchical manner, following the scheme of a full adder tree. Initially there is a list of 1-bit operands. First, all bits (operands) are paired and added together to form a sequence of 2-bit numbers. In the second stage, pairs of these 2-bit numbers are added to form 3-bit numbers. This process repeats until there is only one number remaining, which is equal to the popcount of the original sequence of bits. If the number of operands is odd at any step, the remaining operand is 0-extended and carried over to the next step. At each consecutive stage the number of operands is reduced by half but the number of full adds per addition increases by 1. Batch Normalization primarily serves accelerating training but can also improve accuracy during inference. Batch normalization consists of a scale (multiply) and a shift (add) transformation. In the context of BNNs, it is applied to the result of the popcount before thresholding occurs. Shift-Based batch normalization (which simply replaces the costly multiplication with a shift) incurs a negligible accuracy loss [7, 18] and works particularly well in NV-Net: The shift operation consists simply of writing 0s in the appropriate locations, which overwrite either the most or least significant bits of the input. No data transfer is necessary. Then, addition proceeds as described above. Other networks [27] , do not directly using shift-based batch normalization, implement the same effect by thresholding and a modification of the input weights. Figure 5 : Layout for fully-connected forward propagation. N i j : neuron j in layer i. wi j-k: weight from neuron j in layer i to neuron k in layer (i+1). i=0 in this example. Fully-Connected Layers: In fully-connected forward propagation, each neuron has as inputs all neurons in the previous layer. Hence, for each neuron in the layer, there is a set of weights to every neuron in the previous layer. NV-Net uses data duplication to compute each neuron in the output layer in parallel, as follows: We group one or more rows together and dedicate them to computing one neuron in the output layer. Each group of rows must contain a copy of every neuron in the previous layer and all weights to those neurons. If computing forward propagation from a layer with n neurons to a layer with m neurons, there are m groups of rows, each of which contain n neurons and n weights. Each group also stores any necessary bits for batch normalization and thresholding, and computes in parallel.
BNN Data Layout
The optimal group size g, i.e., the number of rows in each group, depends on the network layer sizes. Larger g increases parallelism but introduces extra overhead due to communication between rows. This is because the results from each row within a group must be moved to a single row in order to compute the final result. The number of extra data transfers required increases with group size. Fig.5 provides an example layout where each group is a single row (g = 1). The input is layer 0; the output, layer 1. Each of the n input neurons are duplicated on m rows. The neurons are then XNORed with the corresponding weights, where the results overwrite the input neurons (as a space optimization). These bits are summed next (popcount), modified by batch normalization, and finally thresholded as described in Sect.3.1. All neurons of the computed layer are then stored in a single column (as demarcated by Outputs in Fig.5 ). The output neurons must then be read out, which can be either the final output of the program, or inputs to the next layer of the network. Convolutional Layers: In 3-dimensional convolution, a filter is a 3-dimensional collection of weights. Filters can have different heights (y dimension) and widths (x dimension), but all filters have weights to all layers (depth, z-dimension) of the input fmap (Sect.2.1). In other words, the depth of the filter must be the same as the number of layers (depth) of the input fmap. One filter will generate one layer of the output fmap. Thus, there are as many layers in the output fmap as there are filters of the input fmap.
To compute the value of an output neuron, each weight (a single bit for BNN) of a filter is XNORed with a neuron in the input (also a single bit). The number of 1's resulting from these XNOR operations are then counted. This sum is next batch normalized and thresholded, to render the bit value of the output neuron. The filter is then slid (in the x and y dimensions) over the input and an output neuron is computed at each position. 6 depicts an example, where the z-dimension is 1 (depth = 1). To compute the output neuron labeled N, each filter bit is XNORed with the input neuron it is overlapping (filter bit 1 with neuron A, filter bit 2 with neuron B, etc...). The filter would then be shifted to the right and the output neuron to the right would be computed accordingly. Fig.7 provides an example of 3-dimensional convolution with depth=2. Neurons and weights at consecutive depths are stored in consecutive cells in the NV-Net array. Output neurons can be computed in parallel as they are not data dependent. Computing output neurons in parallel in NV-Net requires data duplication. As is done for fullyconnected layers, rows are grouped together and each group computes one output neuron. Any input neurons needed for the computation are written into the same group of rows, along with bits for the filter, batch normalization, and thresholding. Thus, each group contains all required data and can operate independently from all other groups. Fig.8 shows an example 2-dimensional (7×7×1) input layer along with two 2-dimensional (3×3) filters. Fig.9 captures the placement of these bits in the NV-Net array, for a group size of 1 (i.e., each row computes one output neuron). Since the filter is 3×3 and the input is 1 layer deep, there are 3×3×1 = 9 filter bits. All 9 of the bits of each filter are written into a single row (in the Filter Bits portion of the NV-Net array from Fig.9 ). They are then duplicated along the rows so that each output bit being computed has its own copy.
In Fig.9 , each row of filter bits on rows 1-49 are all the same and contain the bits for the first Filter (Fig.8b) . The second filter is duplicated (Fig.8c ) on rows 50-99. If there were additional filters, they would be written in the same manner on the following rows. The 9 filter bits can overlap with up to 9 input neurons, and which neurons these are depends on the position of the filter.
The input bits that the filter overlaps at each position are written to separate rows. Thus, there is a row for each possible position of the filter and each position of the filter is computed simultaneously. Since all filters are applied to the same input neurons, the input bit pattern must be repeated for each filter. For example, in Fig.9 , row 1 is repeated on row 50 for the computation of the 2nd filter. Row 2 is repeated on row 51, and so on. Note that the filter can go over the edge of the input fmap. For example, the 3×3 filter centered at input neuron 1 will only overlap input bits 1, 2, 8, and 9 (as captured by rows 1 and 50). Thus, the remaining 5 locations of the 9 locations allocated for the input bits are left empty. These are filled with dummy 0's. Figure 9 : The spatial arrangement of input, filter, and output bits during convolution in NV-Net. Bits of neurons and filters are labelled according to their logical position in Fig.8 . All filters and all locations of each filter are computed in parallel. For clarity, bits for batch normalization and the threshold are not shown.
As computation in each row is independent, XNOR, addition, batch normalization, and thresholding operations can be performed in parallel for all output bits of every filter. At the end of these steps of computation, all output bits are stored in a single column. The example is 2-dimensional for clarity, however, this layout easily extends to 3-dimensional convolution. If the input is z layers deep, each input and filter bit becomes an array of z input bits stored in consecutive cells. For k by k filters, each group of rows must contain k×k×z input and filter bits.
Computation vs. Communication Trade-off
The fully-connected and convolutional layer organizations we covered so far locate the output neurons in a single column. For non-transposed NV-Net arrays this is a bottleneck as each bit must be read out sequentially. One solution is sequentializing computations, by computing not all, but only a subset of the output layer neurons in parallel. For fully-connected layers, this translates into a chosen subset of the output layer neurons being computed in parallel; for convolutional layers, a subset of the filters. This leads to less data duplication. At the same time, we can store the results of each sequential step in separate columns, increasing the parallelism at which they can be read out. Table 1 : MTJ Specification. Technology Parameters: While MTJs manufactured today are capable of being used effectively in NV-Net arrays, they are expected to significantly improve in the coming years. We therefore consider both a modern day and a projected future MTJ specification, listed in Table 1 . The threshold current, I C , is the current at which the MTJ has a 50% chance of switching within the switching time T switch . By setting the write current to 1.5×I C , switching occurs with a probability of error less than 10 −5 . Voltage Signature per Logic Gate: The switching activity of each logic gate output depends on the current though the output MTJ. This current in turn is a function of the voltage applied on the bitlines and the inputs -i.e., the states or resistances of the MTJs which form the gate inputs. The applied voltage directly determines the type of the gate. In other words, the voltage acts as a signature for the gate type. Therefore, correct operation demands preventing potential voltage variation on the bitlines (due to, e.g., manufacturing imperfections) from making one gate act as another gate. Table 2 : Voltage signatures (ranges) for common gates in mV. The voltage signature of a gate is not restricted to a single value, and can assume any value in a gate-specific range, as shown in Table 2 for common gates. The values in the parentheses capture the range. The parallel combination of the input MTJ resistances, in series with the output MTJ resistance, determine the voltage range. Any voltage in this range facilitates the switching activity of the output MTJ according to the truth table of the respective gate, for all input combinations. Each of these voltage ranges can be interpreted as a gate-specific voltage margin: Correct operation is guaranteed, as long as any fluctuation around the assumed voltage signature renders a voltage within the range corresponding to the respective gate.
EVALUATION SETUP

System Configuration
NAND and NOR are 2-input; inverted majority IMAJ-3 and IMAJ-5 are 3-and 5-input gates, respectively. As Table 2 indicates, generally, gates with larger fan-in have smaller margin. The reason is twofold: First, more input resistances in parallel have a smaller combined resistance. Thus, smaller changes in voltage cause a larger change in current. Second, differences in resistivity between different combinations of inputs is smaller. This effect is made worse by the fact that the combined input resistance is in series with the output resistance (which is always the case, independent of the number of inputs). Hence, there is a sharp drop in voltage margin with increasing numbers of inputs.
It is noteworthy that NAND has a larger margin than NOR. Table 3 depicts the combined input resistance (for all possible input combinations) for 2 inputs. Recall that an MTJ in Anti-Parallel (AP) state incurs a higher resistance than in Parallel (P) state. The resistance in AP state (R AP ) corresponds to logic 1; in P state (R P ) , logic 0. R 00 < R 01 = R 10 < R 11 applies, where R 00 captures the combined input resistance for the input combination 00; R 01 , for 01, and so on. For a given gate (hence voltage signature), the current through the output magnet, I, assumes its maximum for the lowest value of the combined input resistance, hence I 00 > I 01 = I 10 > I 11 applies. Let us assume a non-transposed NV-Net design, without loss of generality, where the output magnet is preset to 0 to perform NAND or NOR. For NAND, the output should switch for all combinations but 11; but for NOR, for only 00. Hence, it is critical for NAND to differentiate between R 11 and R 01 (= R 10 ); for NOR, between R 00 and R 01 (= R 10 ). As Table 3 indicates, the difference between the combined input resistances for NAND are much larger, which renders a larger voltage margin for correct operation.
Input State Modern (M) Future (F) R 11 : 2 AP, 0 P 6820 50900 R 01 =R 10 : 1 AP, 1 P 5354 23590 R 00 : 0 AP, 2 P 4725 19050 Table 3 : Input resistance (Ω) for all 2-input combinations.
As Table 2 indicates, for current MTJ specifications, only NAND and NOT gates are practical. The voltage margin for gates with 3 or more inputs are too small to guarantee correct operation. Predicted future MTJs have a much lower switching current but a significantly larger TMR, i.e., (R AP -R P )/R P . The effect of a larger TMR is making differences in the resistivity of different input combinations more pronounced, and thereby increasing the voltage margins. The effect of a smaller switching current is the opposite. For low fan-in gates, the TMR effect is dominant and future MTJs have a larger margin than modern MTJs, despite the lower current. For high fan-in gates, the opposite is the case. NOT and NAND gates have very large voltage margins relative to their voltage signatures. This generally applies to other gates, as well, considering future MTJs. Restricting logic operations to only NAND and NOT gates significantly reduces susceptibility to process variation. Luckily, this is a universal set of gates. We will consider the implications of restricting operations to only NAND and NOT gates in the evalution. Array Configuration: We fix the capacity of NV-Net arrays to 128MB, and use a configuration suggested by NVSIM [8] for an STT-MRAM array. Each array is a single bank which consists of 256 Mats. Mats are organized into 32 rows and 8 columns. Each mat contains 2×2=4 subarrays which are each of size 1024×1024.
Latency and energy estimates due to peripheral circuitry come from NVSim [8] , as well. This renders a close-to-worstcase analysis, as NVSIM estimates only cover the modern technology, without any optimization specific to NV-Net. We also evaluate performance using speculative future peripheral circuitry overheads. As future MTJs are still a few years from being ready for production, the exact supporting periph-eral circuitry overhead is unknown. To estimate it, we scale estimates from NVSIM so that the peripheral circuitry has the same energy and latency percentage share for memory operations as it does in modern STT-MRAM arrays.
While the method by which NV-Net performs computation is different, the array structure is similar to Pinatubo [17] . The demands on our peripheral circuitry are similar to [17] , as well, though we do not use sense amplifiers during computation. In the case of our 2T1M cell design, we use an extra transistor. Given MTJs are lower power devices, the current draw remains relatively modest even during highly parallel computation. Using projections for future MTJ devices (Table 1), a 128MB NV-Net array performing computation on every row would still consume considerably less current than a DRR3 SDRAM write operation [1] .
Since NV-Net requires multi-row access, we follow the approach developed in [17] , where row addresses are supplied sequentially. Modifications to the local wordline driver allow these to be latched until cleared. For logic operations, where multiple rows can be activated, we assume that the rows are addressed sequentially and take into account the latency and energy of both the row activations and voltages applied to the bitlines. System Integration: NV-Net can be attached to the host system as a standalone accelerator over PCIe, or be part of the memory hierarchy (which renders a similar interface to emerging hybrid memory systems featuring NVM). In any case, NV-Net memory space is exposed to the host (e.g., over Direct Memory Access). We assume that all BNN configuration parameters including weights and thresholds are stored in the array prior to computation. These parameters do not change during inference. Additionally, NV-Net arrays are large enough to hold all, along with sufficient space for temporary values. Thus, there is no need to communicate with the host during inference. Different layers of the networks are performed in different NV-Net arrays. This alleviates the need for excessively large arrays and creates an opportunity for pipelining. Inference constitutes two phases: a highly parallel computation phase and a data communication and duplication phase. Thus, each layer of the implemented networks has an associated latency and energy cost for both computation and communication. The computation phase comprises logic and memory operations which move data within the array. The communication phase comprises data transfers between arrays, which consists of reads from the source array and writes to the destination array.
Benchmarks and Baselines for Comparison
We implement two MNIST and two Cifar-10 classifiers in NV-Net, and use two representative FPGA-based BNN accelerators as the baselines for comparison [18, 27] which achieve higher throughput and better energy efficiency than GPU-based implementations. We reproduce the same network topologies in NV-Net 1 . While the supporting hardware is different, the inputs, network sizes, and operations performed are logically identical. As a result, NV-Net results in the same output and accuracy. In other words, we perform an iso-accuracy comparison. We also implement BioNET, a BNN kernel for similarity matching in genomics on NV-Net. To quantitatively characterize NV-Net's performance and energy efficiency, we use an event-driven in-house simulator. Fully-Connected FP-BNN: We follow the layer configuration of [18] . The model consists of 4 fully-connected layers.
The input is a 28×28 gray scale image with 8-bit pixels. There are 784 input neurons (each 8 bits), 10 output neurons, and three hidden layers of 2048 neurons each. The network achieves an 98.24% accuracy on the MNIST dataset. One NV-Net array is dedicated to each layer. Fully-Connected FINN: The fully-connected network in [27] is slightly smaller. The input is 28×28 images that have been binarized. There are 3 hidden layers, each has 1024 neurons. The output is a vector of 10 bits corresponding to the 10 output neurons. It achieves a 98.4% accuracy on the MNIST dataset. One NV-Net array is dedicated to each layer. Convolutional FP-BNN: Following the configuration provided in [18] this network contains 6 convolutional layers, 3 pooling layers, and 3 fully-connected layers. It is divided into 9 NV-Net arrays, 6 for convolution and 3 for fully-connected layers. Pooling layers are computed in the same arrays as the preceding convolutional layers. Data transfer between convolutional and pooling layers is considered part of the computation phase since it is intra-array communication. The input is a 32×32 image with 3 channels. All filters are 3×3 and all pooling layers are 2×2 maxpool. There are 128 filters for the first two convolutional layers, 256 filters for the third and fourth, and 512 for the fifth and sixth. The output of the last convolutional layer is 8, 192 neurons which are input to the first fully-connected layer. There are two hidden layers of 1024 bits each and the final output layer is 10 neurons. It achieves an accuracy of 86.31% on the Cifar-10 dataset. Convolutional FINN: The convolutional network in [27] is similar in structure. It also has 6 convolutional layers, 3 pooling layers, and 3 fully-connected layers. It is also divided into 9 NV-Net arrays with convolutional and proceeding pooling layers computed in the same arrays. All filters are 3×3 and all pooling layers are 2×2 maxpool. There are 64 filters for the first two convolutional layers, 128 filters for the third and fourth, and 256 for the fifth and sixth. There are two hidden layers that each have 512 neurons. The output is 10 16-bit neurons. It achieves and accuracy of 80.1% on the Cifar-10 dataset. Output  1  Conv 4x100x1  64  4x3  1x5  1x20x64  2  Conv 1x20x64  32  1x5x64  1x2  1x10x32  3  Conv 1x10x32  20  1x4x32  1x2  1x5x20  4 FC 100 ---40 (10-bit) BioNET: We introduce a customized binarized neural network, BioNET, which finds if two strings of genomic information match or not. It is used as part of a larger genetic algorithm and it achieves an accuracy of 93.4%, on average. The core operations are the same as other BNNs, however, convolution can occur in different dimensions for different layers. Additionally, the layer sizes are non-uniform, which are listed in Table 4 . As this network is the first of its kind, there is no direct baseline for comparison. As [18] is an FPGA specifically tailored for BNNs, which significantly outperforms CPU and GPU alternatives, we use a BioNET implementation on it as a baseline. To this end, using reported layer sizes and characterization data from [18] , we extract a latency and energy model for an arbitrarily sized network on the FPGA.
Layer Type Input # Filters Filter Size Pool Size
EVALUATION
We evaluate NV-Net for both Modern (M) and Future (F) MTJ, per Table 1 , and consider four different configurations in each case:
• I: Ideal case which has no peripheral circuitry overhead and where all gate types are allowed;
• PC: I with the addition of peripheral circuitry;
• N: I with gate types restricted to NAND and NOT only;
• PC+N: I with both peripheral circuitry and gate type restriction. We evaluate Future MTJ considering two additional configurations:
• FPC: I with the addition of predicted future peripheral circuitry; • FPC+N: I with both future peripheral circuitry and gate type restriction. We evaluate all of these configurations for both the standard (Fig.3) and transposed (T) (Fig.2b) NV-Net arrays . For the configurations accounting for the effects of peripheral circuitry, we consider both the overhead due to row activation and the overhead due to column activation. We adapt the following notation in the rest of the evaluation: M(conf) and F(conf) correspond to the configuration conf ∈ {I, PC, N, PC+N, FPC, FPC+N} implemented using modern and future MTJ, respectively. T attached to any configuration characterizes the transposed variant. For example, F(FPC) T corresponds to a transposed array using future MTJ, which feautures the predicted future peripheral circuitry.
Single Inference Pass
We start the evaluation with the performance and energy characterization for a single inference pass, which translates into the processing of a single image considering FP-BNN and FINN. This reflects the time and energy required to write the input data, compute all layers, and read final output values. Tables 5 and 6 reflect the total latency and energy, respectively, for all networks considering different configurations, along with the latency and energy of baseline FPGA implementations. Table 5 : Overall latency (s).
In this case, as the performance of NV-Net using modern MTJs is not competitive, we only include the ideal case M(I) as a best-case representative. We do not report all configurations due to space constraints, but provide a representative sample preserving entropy: Specifically, we report ideal configurations (such as F(I)) along with configurations accounting for all practical overheads (such as F(FPC+N)), including the transposed variants. That said, we observe that NV-Net with future MTJs is still slower than FPGAs, overall, with roughly 10× the latency. However, NV-Net offers a considerable energy improvement, by approx. 15-50× across the board. The energy for transposed NV-Net is very similar to non-transposed. This is because both perform nearly the same number of operations. That said, transposed NV-Net typically has a lower latency due to a more optimal data layout which allows for more efficient memory transfers. BioNET: Using the same approach as for the previously described networks, we implement BioNET on NV-Net and report the latency and energy consumption. Table 7 lists the results. Consistent with previous findings, the FPGA is faster but NV-Net is more energy efficient. As the application is not real-time, throughput is a more significant metric than throughput. In the next section we show how we can achieve high throughput while benefiting from NV-Net's energy efficiency.
FPGA [18] F(I) T F(FPC+N) T Latency (s) 9.95e-8 1.73e-5 2.21e-5 Energy (J) 2.37e-7 1.07e-8 1.10e-8 Table 7 : Latency and energy characterization for BioNET.
Pipelining and Scaling
In this section we show how the scalability of NV-Net can be used to drastically increase performance, while still taking advantage of the inherent energy efficiency of MTJ based computation. Since inference is performed across multiple arrays, an opportunity for pipelining exists. Once data transfer between different arrays is complete, computation in each array is independent. This allows it to proceed in parallel. Given that NV-Net is energy efficient, many additional arrays can be added to the neural network implementations while maintaining a modest power budget. NV-Net arrays are individual banks, many of which can reside on a single chip. Arrays that are dedicated to consecutive layers can be placed near each other to minimize the distance data must be transported. The cost of this communication is the latency and energy required to read from the source array and to write to the destination array. Since different banks can be accessed simultaneously, the destination array can be written at the same time that the source array is being read, nearly halving the latency. However, to be conservative in the estimation of the communication overhead, we assume that these memory transfers are entirely sequential.
The base pipeline configurations are those described in Sect.4, 9 arrays for the convolutional networks and 5 arrays for the fully-connected networks. However, with pipelining, each array can be computing a layer for a different input image at the same time. Effectively, convolutional networks are computed on a 9-stage pipeline and fully-connected networks on a 5-stage pipeline. Additional NV-Net arrays are then repeatedly added.
Each time an array is added it is dedicated to the layer that will result in an increase in the overall throughput the most. Adding NV-Net arrays is analogous to adding ALUs to a traditional pipeline. Throughput and power are reported for each number of arrays dedicated to the network. Throughput and power scale with the number of arrays, however energy per image remains nearly constant. FP-BNN [18] does not report their specific throughput numbers. Their Cifar-10 classifier is reported to have slightly less than 4× the throughput of the Cifar-10 classifier in FINN [27] . Thus, we estimate the throughput of [18] as 4× the throughput of [27] , and is used as comparison to the NV-Net implementation. For the MNIST classifier, FINN uses a smaller network configuration than FP-BNN. Thus, we compare the throughput of the NV-Net MNIST classifier with the FP-BNN topology to the throughput of FINN.
A few trends apply to all experiments, as captured by Figs. 10, 11, 12, and 13 . The x-axis represents the number of arrays used by NV-Net. The y-axis reports the corresponding throughput and power consumption when using a specific number of arrays.
When using modern MTJs, the power consumption quickly becomes excessive. The NV-Net implementation is still capable of achieving a high throughput, given a sufficient number of arrays, but the energy efficiency makes it impractical. The power consumption is comparable to CPU/GPU implementations when the throughput is similar to that of the FPGA designs.
Future MTJs are much more promising. With a sufficient number of arrays, a NV-Net based implementation is capable of significantly higher throughput than the FPGA based designs, using a fraction of the power. Restricting gate types to only NAND and NOT gates reduces the throughput slightly. This is due to the increased number of gates required to perform the same functions. Peripheral circuitry is more costly. Predicted future peripheral circuitry reduces the throughput to typically less than half. Modern peripheral circuitry, if used with future MTJs,the F(PC) configuration, severely reduces the throughput and incurs a high power budget. This reveals that the peripheral circuitry is critical. Given its poor performance, F(PC) is not shown in the graphs.
The impact of peripheral circuitry, however, is mitigated by the access patterns involved in computation within the array. During communication and the initiation of computation, the cost of peripheral circuitry is high, due to the specification of row addresses. However, the number of active rows does not need to be changed for long periods of time during computational phases. Additionally, the latency and energy cost associated with applying voltages to the bitlines is quite high relative to that of MTJ switching. However, each bitline is shared by many MTJs (and logic operations) and thus this cost is amortized. The peripheral circuitry has less impact during these phases.
F(I) and F(FPC) with the predicted future peripheral circuitry show significant energy efficiency. Even with the addition of hundreds of arrays, the power consumption is only a few Watts. Counter-intuitively, the addition of the future peripheral circuitry reduces the power consumption. However, this does not indicate increased energy efficiency. The peripheral circuitry increases the latency, so while the power is lower, the energy consumption is actually higher, as shown in Table 6 .
Restricting operations to NAND and NOT increases the power consumption. This is for two reasons. One is that it increases the length of the high power computation phases relative to the low power (due to high latency) communication phases. Second, NAND and NOT gates take more energy than higher fan-in gates due to a higher input resistance (and therefore higher input voltage). The implementations on transposed NV-Net outperform those on standard NV-Net in most cases. This is because a more optimal data layout configuration is possible in the transposed configurations. This results in a significant reduction in the communication delay between arrays. As they are performing the same computation as the standard NV-Net versions but at a higher rate, they also consume more power.
Fully-Connected FP-BNN & FINN:
The throughput and power consumption for FP-BNN is shown in Fig.10 and for FINN in Fig.11 . The NV-Net networks are able to achieve a comparable throughput to the FPGAs with the addition of anywhere between 25 and 250 arrays, depending on the configuration. Where the throughput breaks even, the power budget for NV-Net is only a few Watts. Thus, the FPGAs have a latency advantage but NV-Net is much more energy efficient. Convolutional FP-BNN & FINN: Throughput and power results for FP-BNN are shown in Fig.12 and for FINN in Fig.13 . The FPGA implementations have a lower latency, but again, NV-Net is more energy efficient. Interestingly, the difference in latency between the FPGAs and NV-Net is less for convolutional networks. This is due to the scalability of NV-Net. The larger data sets enable a higher degree of computational parallelism within the array. Since NV-Net is memory based, it has a large capacity and is able to handle these larger datasets with less additional overhead. NV-Net is able to achieve the same throughput with typically less than 100 arrays. BioNET: BioNET is a much smaller network than either the Fully-Connected and Convolutional we covered so far. While genomic input datasets are huge, each binary network input is quite small. As we experiment with a fixed NV-Net array size, this translates into a significant amount of wasted space in memory. It is noteworthy that NV-Net could achieve a similar performance on BioNET even if it were to make use of much smaller arrays. However, in the interest of uniformity, and that we are targeting large-scale, in-memory applications, we deem it fair to keep the configuration the same. Regardless, NV-Net is still capable of a high throughput at very low power consumption, as Fig.14 Putting It All Together: The most noticeable advantage of NV-Net is its energy efficiency. This makes it an ideal candidate for mobile applications where power consumption is critical. For example, the FP-BNN Cifar-10 classifier which achieves an accuracy of 86.31%, and accounting for predicted future peripherals and reliable NAND only operations, could process 60 FPS at a power budget of 39.6 µW. However, NV-Net is not restricted to such target applications. If high performance is desired, it can easily be scaled up to achieve state-of-the-art throughputs. Even at such throughputs, the power consumption remains only a few watts. Such scalability is another key advantage of NV-Net. By doing computation in memory, large penalties for data transfers are mitigated. The capacity allows for a large number of parameters to be stored entirely on chip. These combine to make NV-Net particularly well suited for large scale applications which operate on large volumes of data. This is reflected in the fact that NV-Net is slow relative to an FPGA for a single image inference, but is capable of a much higher throughput.
RELATED WORK
(B)NN Acceleration on Traditional Compute Substrates: Numerous NN accelerators have been proposed for forward propagation. For example, DaDianNao [5] uses a multi-chip system to implement high-precision networks. Eyeriss [4] develops a data flow to improve the energy efficiency on a spatial architecture. Many FPGA accelerators also exist [16, 20, 24] . FPGA accelerators capitalizing on the benefits of binarization form the most relevant subset of this rich body of work, including [32, 27, 18] which can achieve significantly higher throughput and energy efficiency than competing CPU and GPU based implementations. SRAM based acceleration is proposed in [9] . However, by definition, SRAM cannot support large data sets.
(B)NN Acceleration on RRAM: When it comes to in-or near-memory NN acceleration using non-volatile memory, RRAM represents the most common substrate: RRAM, like STT-MRAM, is a resistive memory technology, where the state of a cell is stored in the resistivity of the material. Notably, RRAM has multiple resistive states and the extreme ends demonstrate a higher TMR than STT-MRAM. However, RRAM suffers from degradation with use. The state can be switched only a limited number of times before the device begins to fail. A few reduced precision and binary neural networks have been implemented in RRAM. Typically, these networks take on a different form than the structure proposed in this work. RRAM accelerators typically store the weights of a network in a crossbar. The inputs (neuron values) are the voltages applied to the wordlines. The outputs of the operations are the currents on the bitlines, which are sent to an ADC (analog to digital converter) for multi-bit precision networks or a sense amplifier for binary networks. Most implementations rely on external digital logic circuits for a significant amount of the computation, such as the addition and thresholding. Thus, the RRAM crossbar is typically used just as an accelerator for the matrix-vector multiplications. This is in stark contrast with NV-Net where all operations are performed within the memory array itself.
As representative examples, convolutional MNIST classifiers were proposed in [29] and [28] . In [28] , the authors binarized the output from the RRAM crossbar to avoid using ADCs and used inputs as selection signals to avoid using DACs (digital to analog converters). This was significant as ADCs and DACs typically consume the majority of energy and power in convolutional RRAM networks. Many networks were implemented this way, where the most energy efficient network (that was as accurate as the networks evaluated in our study) achieved an accuracy of 98.47% at 2.58 µJ/Image. A similar approach was also used in [26] to classify MNIST and ImageNet data.
On the other hand, the design from [25] implements the fully-connected layers of a convolutional neural network. They activate multiple wordlines simultaneously, use two columns storing complementary weights, and a sense amplifier to compare the difference in current between two bitlines. This allows multiplication, pop-count, and thresholding to occur at the same time within the array, and removes the need for an external adder and comparator. Their network has two 64 neuron layers followed by a 10 neuron output layer. A NV-Net implementation of this same network would be slightly more energy efficient, if peripheral circuitry overheads could be minimized. Considering practical overheads due to modern peripherals, however, NV-Net becomes less energy efficient and considerably slower.
That said, covering only fully-connected layers, the advantage of this RRAM design [25] can only exist for small networks. Increasing the input size degrades the accuracy.
Scaling to the networks evaluated in our study, which are multiple orders of magnitude larger, would make a direct implementation of the analog computing approach of this RRAM design impossible. At the same time, it is subject to errors and noise due to the analog nature of computing, which may not always be masked by the implicit noise tolerance of NN. In fact, as recent work [10] has demonstrated, NN on similar RRAM arrays may suffer from significant accuracy loss due to noise. (B)NN Acceleration on Spintronic PIM: Other spintronic substrates featuring computation capability within the memory array also exist: Pinatubo [17] proposes an architecture to do general bitwise operations in non-volatile memory. The authors in [2] use an MTJ subarray as part of an accelerator for low bit-width convolutional NN. Recent work also covers multi-level MRAM cells to implement BNN [19] . Contrary to NV-Net, all of these platforms use sense amplifiers to perform computation. In addition, all use additional digital logic circuity. Pinatubo [17] embeds digital logic circuits into the memory for inter-subarray computation. The authors in [19] use an auxiliary processing unit to perform batch-normalization, multiplication, and pooling. Finally the design in [2] performs operations such as bit counting, summation, quantization and batch normalization external to the array. This is contrary to NV-Net, which contains no additional digital logic circuitry and which performs all layers of the network entirely within the memory array. Another recent paper [11] proposes MTJ based computation without sense amplifiers in a crossbar topology. As opposed to [11] , NV-Net can support true PIM semantics at scale: effectively, a NV-Net array is not any different than STT-MRAM when not used for computation. Pinatubo [17] is not explicitly tailored to NN acceleration. At the same time the convolutional networks implemented in our study are an order of magnitude larger than the network proposed in [19] . Our baseline spintronic PIM substrate, CRAM, was introduced in [6] , and evaluated for simple and very small scale (non-binary) NN (limited to a single-neuron digit recognizer and 2D convolution, specifically) in [30] . As we cover in Sect.2.2, this basic memory cell and array structure cannot support BNN acceleration without modification.
CONCLUSION
In this paper, we explored the design space of binary neural network (BNN) (forward propagation) acceleration on a spintronic processing in memory (PIM) substrate. The result is a scalable, high throughput, and specifically, high energy effciency solution, NV-Net. We demonstrated that NV-Net can efficiently perform all core building blocks required for BNN forward propagation such as XNOR, pop-count, batch normalization, and thresholding, entirely within NV-Net arrays, without any need for external circuitry, be it analog or digital, to offload computation.
Our analysis revealed that the key strength of NV-Net is energy efficiency. A single inference pass on NV-Net can take significantly longer when compared to representative, FPGA-based hardware alternatives which belong to best performing solutions in this context. However, achieving competitive and even better levels of throughput performance in NV-Net, while preserving the energy efficiency gains, is straight-forward. This is because NV-Net lends itself very well to array-level parallelism -and pipelining computation with (relatively slower) communication -in addition to the inherent intra-array row-level parallelism. We demonstrated that a practically feasible number of NV-Net arrays can outperform the throughput of the FPGA counter-parts while consuming only a fraction of their power. This makes NV-Net a candidate for both low power mobile and high performance computing applications.
Another key contribution of this study is novel memory cell and array architecture designs for spintronic PIM, which we incorporated in NV-Net arrays, and which enabled efficient BNN processing. NV-Net arrays are implicitly reconfigurable, as any row in the array can participate in any kind of computation, according to the algorithmic needs of the underlying problem. Therefore, the presented NV-Net designs can also be expanded to other acceleration problems similar in nature (in terms of computation and communication characteristics) to BNN. arXiv:1606.06160, 2016.
