Abstract-A spin-CMOS hybrid design for neural network is presented. We employ spin torque switched nano-magnets for realizing ultra low power, high speed neuron and domain wall magnets for compact, programmable synapses. The spin based neuron-synapse units operate locally at ultra low supply voltage of 30mV resulting in low computation power. CMOS based longer distance, inter neuron communication achieves high integration. We corroborate circuit operation with physics based models developed for the spin devices. Simulation results for a benchmark application shows 95% improvement in power consumption as compared to 45nm CMOS design.
INTRODUCTION
Hardware implementation of computation architectures based on artificial neural network (ANN) has always been challenging in terms of power consumption, level of integration and throughput. Prior work in this field involved development of circuit models for neurons and synapses using CMOS, and in general employed large number of transistors [1] [2] [3] . Apart from resulting in high power consumption, this limited the degree of integration.
Application of programmable resistive elements like TiO 2 memristors [4, 5] and phase change memories [6] have led to compact synapse models and has taken neuromorphic hardware design a step forward. The resistive elements however, dissipate significant amount of power per computation. Also, integration and thresholding of analog charge current values, received through multiple resistive synapses, requires complex, analog circuitry for the neuron. This results in a power hungry implementation for high speed operation.
In order to tap the potential of neural network based computation at the hardware level, the device-circuit models for the neuron and the synapse, apart from being compact, should also achieve low power consumption. Thus, it would be highly desirable to come up with an energy efficient device structure that can couple the operations of neuron as well as synapse into a compact, modular and homogenous unit, which can be integrated with the state of art IC technology.
In this work we propose the application of spin devices in ANN hardware design. Theoretically, spin transfer torque (STT) induced switching of scaled nano-magnets requires orders of magnitude less energy as compared to charge based capacitive switching [10] . A nano-magnet, can therefore, acts as a low power and compact thresholding unit. All spin logic (ASL) design based on spin majority evaluation using lateral spin valve structure has been proposed previously [9, 11, 15] . We show that, with an appropriate clocking scheme, a spin majority gate with weighted inputs mimics the neuron-synapse functionality. Programmable spin injection strength of domain wall magnet can be used to implement a compact synapse. In the proposed neuron-synapse model, charge current flows through a low resistance path that constitutes of the nano-magnets and non-magnetic metal channels. This allows application of ultra low terminal voltages, resulting in low power consumption.
Energy dissipation for spin mode computation increases steeply with the separation between nano-magnets. This is due to the limited spin diffusion length of non-magnetic channels [9, 10] . Hence, spin mode signaling between two neuron units would prove inefficient. Therefore, in this work we employ CMOS based, charge mode inter-neuron signaling scheme in order to allow high degree of on-chip integration. The programmable, spin-CMOS hybrid ANN architecture clubs the benefits of localized, spin based ultra low energy computation and robust charge mode communication. The proposed neural network architecture employing spin based neuron-synapse units can be suitable for low power programmable hardware for both, cognitive as well as Boolean computation.
The rest of the paper is organized as follows. Section.2 describes the operation of spin majority gate based on lateral spin valve (LSV). Detail description of the proposed neuronsynapse model is given in section.3. Section.4 discusses system level integration. Device simulation framework employed in this work is briefly discussed in section 5. Performance of the spin based ANN design for a benchmark application (hand-writing recognition), and its comparison with 45nm CMOS analog and digital designs is given in section 6. Summary and conclusions are given in section 7.
II. LATERAL SPIN VALVE
Two different methods of current induced STT based switching of nano-magnets have been proposed in recent years. The first involves injection of spin polarized charge current into a nano-magnet through another magnet. This phenomenon has been widely explored for magnetic tunnel junction (MTJ) based memory applications [24, 25] . More recently, a second strategy has been demonstrated which employs pure spin current injection for flipping a nano-magnet [7, 8] . Fig. 1a shows the lateral spin valve (LSV) structure employed in this method. It consists of a transmitting magnet and a receiving magnet connected through a non magnetic channel. The two stable states of the magnet (left and right spin) are determined by the magnet anisotropy (uniaxial anisotropy, K u ) [9, 10] . Electrons flowing into the channel through the transmitting magnet (which is pointing "right") get polarized in the "right" direction when they reach the magnetchannel interface. As a result the channel below the input magnet gets populated with electrons with right spin polarization. Spin polarized charge current is modeled as a four component quantity, one charge component and three spin components (Is x , Is y Is z ) [9, 10] . Charge component of the input current flows into the ground lead. The output magnet-channel interface absorbs the transverse spin components of the current which in turn exerts spin torque on the output magnet and causes it to flip. Owing to the separation of the spin diffusion current responsible for nano-magnet switching, from the charge current flow, spin transport in the lateral spin valve is often termed as 'non-local. Fig. 1b shows the device structure for five input spin majority gate based on lateral spin valve. The majority function can be employed to perform non-Boolean computations [12, 17] . A clock synchronized operation of the spin majority gate can be compared to that of a neuron, if the output magnet's state is restored after every flipping. The two spin polarization states of the input magnets are analogous to bipolar, binary synapse weights with values +/-1. In this work we propose the use of domain wall magnets as input synapse to realize programmable, bipolar, multi-level weights for a spin based neuron model.
To reduce the amount of average current injection per synapse we incorporate current mode Bennett clocking in the neuron model [9] . It involves switching the nano-magnet to an intermediate meta-stable state from which, it can be switched back to one of its stable states with a very small current. In the proposed neuron model the output magnet is switched with non-local spin torque, i.e. with pure spin current. It will be shown that this technique is helpful in achieving ultra low voltage operation and hence low power consumption.
III. SPIN BASED NEURON-SYNAPSE MODEL
In this section we present describe the spin based neuronsynapse model. First we discuss the application of domain wall magnet as a synapse. Following this, the neuron model is described which is based on the lateral spin valve structure discussed in section-2.
A.
Domain wall magnet as synapse Domain wall magnet (DWM), shown in fig. 2a , consists of two ferromagnetic domains separated by a non-magnetic region or domain wall (DW). Domain wall is formed due to balance in anisotropy and exchange energies present in nanomagnet.
Domain wall can be moved along a magnetic nanostrip by application of magnetic field [18] or by injection of charge current along the nano-strip. [19] . Fig. 2b shows the simulation plot for domain wall velocity vs. injected current density, benchmarked with experimental data in [20] . [20] Application of DWM in the design of non-volatile memory [21] and logic design [22] has been explored by several authors. In the present work, we propose the use of DWM as synapse, where its programmable spin injection strength is used for implementing spin mode weighting operation. Fig. 3 shows a domain wall magnet interfaced with the non-magnetic channel of a neuron. In order to write the weight into the DWM, current is injected along the length of the domain wall as shown in fig.  3 . Under this condition the channel is kept in a floating state. During computation, the input current is injected into the channel through the domain wall in the vertical direction. Fig. 4 shows the plot for spin polarization of current passing into the channel through the DWM vs. domain wall location for different charge current values. It can be observed that, spin polarization strength of the charge current reaching the channel is proportional to the offset of the domain wall location from the centre. For the extreme left location of the domain wall, the charge current reaching the metal channel is maximally left polarized and vice-versa. The net polarization is reduced to zero for the central location of the domain wall, as equal amount of left and right spin electrons are injected into the channel in this case.
Domain wall magnet

Fig. 4 Spin polarization strength current injected through DWM as a function of DW location
The thin MgO layer incorporated between the DWM and the channel serves dual purposes. It enhances spin injection efficiency by reducing the spin resistance mismatch between the channel and the magnet [14, 16] . It also reduces the fringe current passing through the parallel path provided by the floating channel during the write operation. Hence writing and computation modes are fully decoupled. Note that, the interface resistance resulting from the magnet-oxide-metal tri layer is around significantly less than that of a magnet-oxidemagnet tri layer in a magnetic tunneling junction [14] . Hence there is no significant increase of resistance in the current flow path during computation.
In the simplest case, the two extreme locations of the domain wall can be employed for implementing programmable binary weights. It has been shown that incorporation of nano-scale notches in the DWM strips can enhance the stability of DW at the notch sites [23] . The incorporation of notches along the length of the DWM synapse can help in achieving larger number of weight levels with higher writing accuracy. Fig. 5 shows the magnetization state of the DWM at equal time intervals after the application of 100psec voltage pulse train.
The physical interface for writing and computation modes are described in section 4. Physics based device modeling of these two operations are discussed in section 5. 
B. Spin based neuron Model
Transfer function of an 'integrate' and 'fire' neuron is given by eq. 1.
( )
Here, w i and I i are the weights and corresponding inputs and b is the neuron bias. The bias can be chosen to be zero. It however aids in training convergence and can be easily implemented by an additional synapse magnet which is driven by a clock. The function f(x) is given by eq. 2 and approximates a step transfer function for a sufficiently large N.
Here t denotes the threshold of the neuron. It can be inferred that a higher |t| would require a larger value of |x| to switch the neuron. For a given set of normalized weights W i, this translates to larger levels of the input signals I i. For the spin based neuron model, this implies larger input current per synapse and hence higher power consumption. Therefore, switching threshold of the output nano-magnet needs to be reduced. We incorporate current mode Bennett clocking to achieve this.
The device structure for the neuron is shown in fig.  6 . The firing magnet forms the free layer of an MTJ. The two antiparallel, stable polarization states of a magnet lie along its easy axis (fig. 6 ). The direction orthogonal to the easy axis is an unstable polarization state for the magnet and is referred as its hard axis [9, 12] . The preset magnet shown in fig. 6 has its easy axis orthogonal to that of the neuron magnet.
In the beginning of a clock period, current pulse injected through the preset magnet forces the output magnet to the hard axis configuration ( fig. 7) . As soon as the hard axis biasing pulse goes low, the free layer makes transition to the easy axis polarity governed by the polarity of net spin polarization current delivered to it. As a result, the firing magnet, i.e., the free layer of the MTJ acquires either parallel or anti-parallel polarization with respect to the fixed layer. When the clock is low, the MTJ unit is activated. For a parallel orientation of the free layer, it generates a high output whereas for the anti-parallel orientation, it settles to a low value. Hence, the MTJ converts the spin mode information of the 'firing' magnet's state into a charge mode signal. Thus, spin and charge mode evaluations occur in alternate clock phases. In the proposed neuron model, the use of non-local STT switching allows a low resistance path for static charge current flow that includes the DWM synapse and the nonmagnetic channel. This allows application of very small voltages, which in turn results in ultra low energy operation for the magneto-metallic neuron-synapse unit. The detection scheme involves a dynamic CMOS latch, discussed later, which prevents static current injection into high resistance MTJ stack. In this section we present system level integration that includes the inter-neuron signaling scheme, power gating of spin devices, and domain wall writing interface.
A centre-surround layout for a neuron with 12 input synapses is shown in fig. 8 . Spin polarized charge current inputs from DWM synapses combine in the channel and flow into the ground lead (not shown in fig. 8 ) located below the neuron MTJ. Spin polarization strength of charge current decays exponentially with the distance travelled along the non-magnetic channel. Thus, the channel length between the synapses and the neuron must be less than 1-2 times spin flip length λ. This imposes a limit on the number of input synapses for the structure shown in fig. 8 . For copper channel (λ~ 1µm) up to 32 synapses can be combined directly. For graphene channel (λ~ 7µm) this number can be higher. The spin mode firing information is converted into charge mode signal using the differential MTJ latch shown in fig. 9a . It compares the effective resistance of the MTJ units in its two load branches. The firing MTJ of the neuron unit connects to one of the loads. Transient simulation plots in fig.  9b show that the latch evaluates correctly for a resistance difference as small as ~5%. In order to exploit the ultra low voltage operation of nano-magnets, the spin layer is operated between two DC levels with a difference of 30mV. The DC biasing scheme is pictorially shown in fig. 10 . MTJ output of source neurons drive PMOS current source transistors, which in turn supply charge current to all the destination neurons via DWM synapses. The synapse current flows between two DC levels V high and V low, where V high is connected to the source terminals of the PMOS current sources, and V low is connected to the ground lead of the neuron-synapse units. CMOS based detection units and clocking circuitry operates between V low and V ss. In the present work we chose V high = 830mV, V high = 800mV and V ss. = 0V. Charge recycling between the two DC levels V high and V low can further reduce the spin mode computation power. Note that, owing to comparatively large resistance of the source transistors, they account for most of the voltage drop and hence, power consumption for the spin mode computation.
As mentioned earlier, during DW writing, the channel is kept floating. For the computation mode, synapse input current enters the channel through the DWM input lead. Thus, the area of the DWM under the input lead effects the polarization of the input charge current and hence can be regarded as the active synapse area. fig.11 . A pair of source-select and destination-select line corresponds to a particular source and destination neuron pair and hence the interconnecting synapse weight. Thus by the selection of the source and destination neuron units the DWM synapse to be written is identified. Note that, this scheme employs only two additional transistors per neuron for selecting and writing into a DWM synapse. The supply lines are driven by pulsed current sources and the number of current pulse injected is determined by the weight to be written. The writing current pulses are injected into the DWM strip through the programming leads. Whereas the input current pulse from the source transistor is delivered through the input lead. 
VHigh
V. DEVICE SIMULATION FRAMEWORK
In order to simulate the lateral spin valve structure shown in fig. 1a , we need to self-consistently solve both the transport and the magnet dynamics equations. In our model, the channel spin transport is based on the spin diffusion model developed by Valet---Fert [26] , The magnet-channel interface is modeled based on the interface model developed by Brataas et al. [27] . Both these models are well established and are used for spin transport in long channels [9, 10, 11] . The effect of thermal noise is captured by stochastic LLG and temperature dependent spin diffusion length in magnet dynamics and channel spin transport respectively. This simulation framework has been benchmarked with experimental results for lateral spin valves [9, 10] . The device model for the domain wall structure is derived from the aforementioned spin diffusion model. It consists of a 2-D grid of nano-magnets obtained by dividing the nanostrip into square grids (10nm x 10nm) as depicted in fig. 12 . Each nano-magnet is modeled as a ∏ conductancenetwork with shunt and series components G 0F and G F (Four Component Spin Transport model), respectively, using ValetFert diffusion model [26] and interface model by Brataas [27] . The resulting spin circuit is shown in fig. 13a . It yields the spin current components at each lattice points for a given input voltage. These spin currents are used to evaluate LLG at each point to capture the nano-magnet dynamics. The conductance matrices are dependent upon the magnetization state of the grid points and hence, the spin diffusion transport is solved self consistently with LLG at each grid point. We benchmarked our simulation framework for DWM with experimental data in [20] . The corresponding plot for DWM velocity as a function of charge current density is shown in fig. 2b . The effect of channel interface on the writing process is incorporated by including the nano-magnet-channel interface conductance matrix in series with the channel conductance matrix at each grid point as shown in fig. 13a . The interface conductance matrix constitutes of series combination of a spin dependent and a spin independent conductance component as described in [16] .
As discussed earlier, during computation, the input current is injected into the channel through the domain wall in the vertical direction. Hence, writing and computation modes are fully decoupled. Therefore, for the computation mode, the DWM synapses can be modeled as two parallel nano-magnets with opposite polarities and area dependent on the domain wall location i.e., the weight. The simplified model, shown in fig. 13b , has been used for network simulations. 
A. Benchmark Applications
We simulated character recognition as a benchmark application for the proposed spin-CMOS hybrid design. The overall process for character recognition can be divided into two steps, namely, edge extraction and pattern matching. For edge extraction, column wise pixels form the binary image along four directions -horizontal, vertical and + 45 o are fed to the first stage neurons.
These neurons generate a high output if the number of non-zero pixels along a particular column (or equivalently the spin current input I in to the neuron) is higher than the neuron threshold. Note that, a desirable threshold for a neuron is set by applying a bias input to it. The horizontal edge extraction process for different input character is depicted in fig. 14a . Fig. 14b shows the effect of variation in the handwriting style for the numeral '3' on the horizontal bar code. It shows that, significant variations in writing style translate to slight variations in the barcode pattern which can be tolerated by an ANN. Variation tolerance can be enhanced by training with different styles of input characters. The resultant four binary patterns form a 1-D representation of the input character. This pattern is fed to the output stage of the network for classification. The output neurons correspond to the 36 alpha numeric characters. The output evaluation for numeric characters is shown in fig. 14c .
We also simulated Boolean logic blocks with proposed ANN architecture. Arithmetic computation blocks like multipliers and adders, the required network size grows exponentially beyond an input dimension of 4x4, because of large training set. Hence, for larger number of input bits, the overall computation can be decomposed into 3x3 or 4x4 units in order to obtain maximum benefits.
B. Design Performance
In order to establish a comparison with state of art CMOS technology we implemented the same network architecture in IBM CMOS 45nm SOI technology in two different ways, digital and analog. For the digital design, programmable latches were used to store synapse weights and full adders were employed to implement neuron. For the analog design, memristive synapses were employed. Resistance values in the range of 10kΩ to 200Ω were used to emulate memristors. In this design analog integrators were employed for modeling the neuron. The area was estimated based on the cross bar architecture for memristive neural network [4] . Table-I compares the two designs with the proposed spin based implementation. N n denotes the number of neurons and F s denotes the clock frequency. The digital implementation consumes large area as well as power due to bulky neuron and synapse units. The analog implementation with memristive synapse turns out to be the most inefficient in terms of power. However, it achieves a large improvement in area as compared to the digital design due to compact synapses and cross-bar architecture [4, 5] .
The spin-CMOS hybrid implementation achieves both, low power as well as small area. The power and area benefits of the proposed design can be ascribed to simple and compact spin devices that operate at ultra low supply voltages and mimic the neuron operation. Both, low energy consumption, as well as compactness is conducive to integration of large number of neurons for programmable computational networks for cognitive and Boolean computation. Table-2 and table-3 provide some relevant  design details. Finally table 4 enlists some of the critical device parameters used in the simulation.
VII. SUMMARY
Spin device phenomena like, majority evaluation, hard axis switching, and adjustable spin polarization strength of domain wall magnets, clubbed with appropriate clocking scheme can lead to an energy efficient model for neuron-synapse unit. The localized, ultra low voltage operation of neuron-synapse units, assisted with efficient circuit and architecture level design strategies for signaling, power gating and detection can facilitate high degree of integration. The proposed spin-CMOS hybrid ANN design can be suitable for low power, programmable computation architecture for cognitive as well as Boolean applications.
