A new spintronic nonvolatile memory cell analogous to 1T DRAM with non-destructive READ is proposed. The cells can be used as neural computing units. A dual-circuit neural network architecture is proposed to leverage these devices against the complex operations involved in convolutional networks. Simulations based on HSPICE and MATLAB were performed to study the performance of this architecture when classifying images as well as the effect of varying the size and stability of the nanomagnets. The spintronic cells outperform a purely charge-based implementation of the same network, consuming ≈ 100-pJ total energy per image processed.
I. INTRODUCTION
A HARDWARE implementation of feedforward neural networks must incorporate three basic functionalities: a dot-product engine that can be used to compute convolution and fully connected layer operations, memory elements that can be used to store intermediate inter-and intra-layer results, and components that can compute some nonlinear activation functions. Many purely charge-based implementations of these, with varying levels of efficiency, have been proposed. The dot product has been successfully implemented in hardware in various ways [1] - [4] . Using these dot-product circuits, a cellular neural network (CeNN) accelerator for convolutional neural networks (CoNNs) was designed in [5] . There have been several proposals in recent years for spintronic implementations of CeNNs or CeNNlike structures in hardware, which could be tapped to produce an efficient spintronic CoNN accelerator. Among these are magnetic tunnel junction (MTJ)-based platforms using spin-diffusion [6] , current-driven magnetic domain walls, and spin-Hall torque [7] - [9] . Another platform that uses inverse spin-orbit magnetoelectric stacks instead of MTJs has also been proposed. The latter, though more exploratory compared to MTJ-based devices, was predicted to perform basic CeNN functions both more quickly and more efficiently than the others [10] . In this article, we propose to utilize this platform as a hybrid memory/CeNN cell with a high energy efficiency that can be used as analog memory with a built-in activation function. The performance of these cells is simulated in a CeNN-accelerated CoNN performing image classification based on [5] . The spintronic cells significantly reduce the energy and time consumption relative to their charge-based counterparts, needing only ≈ 100 pJ and ≈ 42 ns to compute all but the final fully connected CoNN layer while maintaining high accuracy.
II. IRMEN

A. IRMEN NEURONS
The inverse Rashba-Edelstein magnetoelectric neuron (IRMEN) has been thoroughly described, including its relationship to standard CeNN cells, in [10] . A brief summary will be given here. The cell bears some resemblance to the standard cell of a CeNN in that it is based around a capacitor (see Fig. 1 ) [11] . However, the capacitor represents an input mechanism rather than the true state. A magnetoelectric material within the capacitor is coupled to the ferromagnet (FM) that makes up one of its electrodes, thereby allowing control of the FM via electric charge on the capacitor [10] , [12] - [14] . Readout is accomplished by driving a charge current first through the FM to spin-polarize the current and then through an inverse spin-orbit stack that transduces from spin current to charge potential along an axis orthogonal to both current flow and spin orientation [15] - [17] . This is modeled as a voltage source with the inverse Rashba potential V IR and internal resistance R IR [10] . The IRMEN cells natively compute a nonlinear transfer function on their inputs by virtue of the M -H curve dictated by the anisotropy of the nanomagnet. This transfer function can be tuned by varying the anisotropy characteristics. Example transfer functions are shown in Fig. 2 , assuming uniaxial crystalline anisotropy along the length dimension.
B. IRMEN NONVOLATILE MEMORY
The original proposal considered the IRMEN cells only as real-time computational machines [10] . We now propose that they can also operate as analog memory. Memory is a crucial component to any hardware-based implementation of a neuromorphic computing scheme. In anything more complex than a simple fully connected network, such as a CoNN, the input values to any given layer need to be referenced multiple times [18] . This necessitates the inclusion of some form of analog or digital memory accompanied by analog-to-digital converters (ADCs) and digital-to-analog converters (DACs). The simplest solution is to use conventional 1T DRAM cells. However, one of the issues from which standard 1T DRAM suffers is the loss of charge upon READ. If the state is not refreshed, the leakage of repeated READ cycles will degrade the stored state. The IRMEN discussed earlier, besides being able to realize neuromorphic CeNN-like operations, can also function as a memory. Because it serves a dual purpose as a neural computing unit and analog memory, the IRMEN offers enhanced efficiency. Here, the state is stored in the capacitor. The IRMEN readout mechanism, crucially, does not involve accessing the charge stored in the cell capacitor. This charge may remain safely locked in place, while the cell state is read out through injection of current into the FM. As this temporarily disturbs the actual electric field across the capacitor and therefore the actual steady-state magnetization, a sufficiently swift READ pulse will ensure that the READ completes, while the FM is just beginning to move depending on the relative time scales of electrical and magnetic motion. More importantly, the cell will return to the same state after any perturbations caused by the READ without any loss of information. Subsequent READS will obtain the same actual value as the first. The perturbation during READ, even if non-negligible, will be consistent and can be accounted for. We note that the intrinsic WRITE energy of the IRMEN memory is equal to the charging energy of the magnetoelectric capacitor. Using values from Table 1 , we estimate the WRITE energy to be no more than 10 aJ, putting it on par with typical 1T DRAM values. This value scales with the capacitor area though we note that this estimate does not include the energy dissipated by external circuitry during the WRITE process. Again, from Table 1 , we estimate the total WRITE energy between 0.24 and 0.79 fJ for one cell. While this is higher than some other nonvolatile memories such as STT-RAM, the higher energy is compensated for by the analog, as opposed to digital, nature of the IRMEN cell as well as its ability to function as a neural computing unit in addition to memory [19] .
A basic simulated demonstration of the functionality of the IRMEN memory cell is shown in Fig. 3 . The magnetization is switched between two different levels, held for an extended time of 12 ns by the charge stored on the capacitor and then switched again. The drive current, I DRV , is pulsed for the periods of 200 ps to generate a readout potential at various points throughout the simulation. The simulation results demonstrate the ability of the IRMEN to store a value, encoded as charge, and provide a transformed readout value on command. We note that there is some noise in the readouts shown in Fig. 3(d) . In order to quantify this error, we performed more exhaustive simulations. For nanomagnets of several different sizes, Monte Carlo simulations of 12 300 iterations were performed. In each iteration, an input current was applied and the neuron was then prompted to provide a readout. The results approximated a saturated linear transfer function as in the ideal CeNN cell [11] with some error. After normalizing according to the base input current and output voltage values as described in Section IV, the error is calculated as the difference between an idealization of the transfer function and the actual readout. The average magnitude of the normalized error is shown in Fig. 4(a) . Histograms indicating the range of errors for the largest and smallest nanomagnets are shown in Fig. 4(b) , displaying the significant increase in thermal broadening with smaller and more unstable magnets.
Section III describes a network of IRMEN cells for CoNN implementation.
III. CONVOLUTIONAL NETWORKS WITH IRMEN CELLS
Lou et al. [5] proposed using standard, purely chargebased CeNN cells with weighted, programmable operational transconductance amplifier (OTA) current sources to produce an in-hardware implementation of a CoNN [1] . The CoNN in question used the rectified linear unit (ReLU) activation function to transform the results of each dot product. The CoNN also used the local pooling function. Pooling reduces the size of the convolutional image maps and introduces some robustness to feature translation by collecting the largest numerical output from each of several subsets of the maps and disregarding all other data. Lou et al. [5] described how the ReLU activation and local pooling functions can be emulated by a sequence of CeNN operations so long as the output values are not too large. In this way, all but the fully connected output layer of a CoNN could be implemented via simple CeNN cells. We propose to improve on the charge-based CeNN/CoNN model in [5] by replacing the purely chargebased cells with a more energy-efficient IRMEN cell. The CoNN structure to be emulated is shown in Fig. 5 .
A. COUPLED CeNNs FOR MEMORY AND COMPUTING
To fully leverage the IRMEN capabilities, we propose a dualfunctionality operation mode. Each stage corresponds to one CeNN operation, represented by a specific CeNN template imposed upon the data as it is slowly transformed from the initial input to the final output. Each stage of computation is implemented by one of a pair of identical IRMEN CeNNs similar to the structure in [5] . One of the pair provides the input, obtained from the previous stage and locally stored, to the other of the pair for processing and subsequent storing (see Fig. 6 ).
The proposed dual-CeNN design allows us to avoid using a dedicated state storage memory module and most of the associated peripheral circuits such as ADCs and DACs. Weight storage must still be implemented, although additional IRMEN memory cells could be included for this purpose. We expect significant savings in operational energy and possibly also speed as a result. Although the delay between the stages must be on the order of nanoseconds to allow sufficient time for the magnetizations to respond to the charge placed on the capacitors, each stage only needs to be powered for about 130 ps, which is the estimated combined delay 
In stage 1, network A acts as memory cells for B; in stage 2, network B acts as memory for A.
of the OTAs [1] and other electrical components. The total operational delay between each stage is 1.5 ns, which is mostly due to the magnetic switching delay as previously mentioned.
IV. SIMULATION
This section describes the methods by which we simulate IRMEN CeNNs emulating an image classification CoNN as previously described. The results of this simulation as measured by the predicted accuracy and energy cost are also presented herein.
A. SIMULATION SETUP
In order to evaluate the utility of the proposed IRMEN-based architecture, we tested it on the MNIST handwritten digits data set. The structure of the CoNN to be implemented via IRMEN CeNNs is shown in Fig. 5 and is identical to one of the test cases in [5] . In the interest of time, the training of the weights was implemented using TensorFlow with a custom transfer function representing a close numerical approximation to the hysteresis curve of the IRMEN nanomagets. The trained weights were then given to a MATLAB simulator that used the fourth-order Runge-Kutta method to simultaneously solve the Landau-Lifshitz-Gilbert (LLG) equation and the electrical circuit equations associated with each neuron. The IRMEN magnetodynamical equations used were identical to those of [10] . The FM is modeled using the macrospin approximation. The LLG equation relates the motion of the unit magnetization m to the net field
where γ is the gyromagnetic ratio, µ 0 is the vacuum permeability, α is the Gilbert damping, and H Eff is the effective field, calculated according to the methods in [10] . The simulation parameters are given in Table 1 .
In order to map from the Tensorflow network to an IRMEN simulation, we must translate the arbitrary state and weight values into electrical values associated with the IRMEN cells. A neuron state value can vary between ± 1 depending on the magnetization of the associated nanomagnet and is read by measuring V IR . This value depends on the driving potential in addition to the state of the nanomagnet. We define the maximum output potential V 1 to be the neuron state of 1
The normalized output of a neuron is V IR /V 1 . This voltage is applied to the input of an OTA that consequently produces up to 1 µA of current to inject through the shunt resistor of a subsequent cell (see Fig. 1 ). The normalized input to a neuron is thus I IN /(10 −6 ). We use a similar two-stage OTA design as in [5] . The first stage in the OTA is a differential input pair. The second stage is a current mirror that subtracts two branches of current to obtain a single-ended output while providing a large output impedance. The ratio of the current mirrors between the two stages is set to 2 to save power in the first stage of the OTA. We use multiple OTAs to represent different numbers of bits for weights by power gating. By reprogramming these OTAs, the desired weights are achieved in each step.
B. IMAGE CLASSIFICATION RESULTS
The simulated CoNN with IRMEN-based convolution, activation, and max-pooling templates was tested against 10 000 images withheld from the training stage and achieved high image classification accuracy (see Fig. 7 ). The energy required to perform each step involved in the classification of an image comes from four different terms where N A and E A are the number and charging energy of the various access transistors, respectively, N OTA , E OTA , and P OTA are the number, charging energy, and operating power of the OTAs, respectively [1] , N DRV and P DRV are the number and operating power of the IRMEN memory cells, respectively, and τ is the 130-ps powered operating time of an IRMEN memory cell required to apply charge to the input of the next cell (see Section III). Each layer of the network shown in Fig. 5 requires powering a different number of IRMEN cells and OTAs and accompanying access transistors in order to process the entire image. Table 2 compares the energy and delay of the IRMEN network to its charge-based counterpart. These values assume the largest version of the IRMEN cells at 60 × 60 nm 2 . Compared to the previously implemented purely charge-based version, using the IRMEN cells provides significant energy and time savings. According to [5, Table 3 ], the charge-based CeNN requires about 12 nJ to compute all convolution, pooling, and activation stages, while the IRMEN CeNN needs less than 0.14 nJ. We also note that the overall IRMEN CeNN/CoNN delay is 42 ns, while the CeNN in [5] takes 240.5 ns to perform the same computations. The relationship between energy and FM length is shown in Fig. 7(a) . The classification accuracy is plotted against the FM length in Fig. 7(b) , showing that greater thermal stability improves overall network performance. However, this comes at the cost of increased energy and area. The correlation between accuracy and energy expenditure is shown in Fig. 7(c) . The origin of this increased accuracy is the reduced error in the saturated linear transfer function computed by each IRMEN cell (see Fig. 2 ). The classification accuracy is plotted directly against the mean transfer function error magnitude in Fig. 7(d) . We also consider the effect of using weights with limited resolution. Although the specific weight storage mechanism is not considered here, it is reasonable to assume a limited representation accuracy. Thus, in Fig. 8 , we show the image classification accuracy versus FM length with the number of bits used to encode a weight value as a parameter, starting with 4-bit precision. Although lowering the precision to 4 bits is detrimental, the performance is still quite good for lengths above 30 nm. For comparison, Fig. 8 also shows a dashed line indicating the performance of an idealized software model. This model achieved 97.55% accuracy, while the best performance of the simulated IRMEN network was 94.13%. We attribute this discrepancy to thermal noise, which causes fluctuations in the IRMEN readout. Even when very small, these fluctuations can occasionally cause changes in the final result as they propagate throughout the network.
We also compare the IRMEN CoNN to a similar spintronic implementation. He and Fan [28] proposed a CoNN based on the skyrmion neuron cluster (SNC). The best classification accuracy the SNC CoNN achieved was 98.74%, compared to 98.92% for an idealized software model. Part of this increased accuracy over the IRMEN CoNN comes because of the increased complexity of the CoNN architecture that was used in [28] . The two convolutional layers of the SNC CoNN contained 6 and 12 templates, respectively [28] as opposed to 4 each, which was used in this article to reduce simulation complexity. This results in a 1.37% difference in the idealized software model accuracy and a corresponding reduction in the simulated accuracy. He and Fan [28] also do not indicate that they have accounted for thermal noise, which we believe to be the main source of the difference between the maximum simulated accuracy and the ideal accuracy in the case of the IRMEN CoNN. The IRMEN devices have a direct advantage in size, as the SNC magnetic films have the dimensions of 150 × 60 nm 2 , while the IRMEN magnets can be made as small as 30 × 30 nm 2 without significant loss of function.
V. CONCLUSION
With the growing importance of neuromorphic computing and beyond CMOS computation, the search for new devices to fill these roles is crucial. We have proposed a novel magnetoelectric analog memory element with a built-in transfer function that also allows it to act as the cell in a CeNN. Simulations of a CeNN-friendly CoNN implemented with these IRMEN cells predict that highly accurate and low-power image classification can be achieved with this device. These results are superior to those predicted for purely charge-based implementations of the same network. This clearly demonstrates the benefits of applying spintronics to neuromorphic computing. His current research interests include circuit and architecture design, architecture and system design and modeling for computer vision and machine learning applications, emerging devicesbased benchmarking, and circuit implementation.
