The energy and delay reductions from CMOS scaling have stagnated, motivating the search for a CMOS replacement. Spintronic devices are one of the promising beyond-CMOS alternatives. However, they exhibit high switching error rates of 1% or more when operated at energy and delay comparable to CMOS, rendering them incompatible with the deterministic nature of digital implementations. In this paper, we employ a Shannon-inspired model of computation to enhance the tolerance of all-spin logic (ASL)-based implementations to gate-level switching errors. We develop the logic-level path delay reallocation techniques to shape the output error statistics and propose a novel error compensation scheme to achieve 1000× higher tolerance to device-level switching errors while maintaining the classification accuracy of an ASL-based support vector machine (SVM) classifier.
I. INTRODUCTION
T HE PAST few decades have seen tremendous improvement in computational efficiency, in part, due to relentless CMOS scaling to achieve the improved density of transistors while reducing their switching energy and delay and preserving nearly error-free switching behavior. However, as the channel lengths continue to reduce beyond a few tens of nanometers, the energy and delay reductions have stagnated. Hence, it is of great interest to explore new computational devices and new models of computation that leverage the unique properties of such devices to enable continued computational scaling.
In particular, spin-based computational devices built with nanomagnets and spin-polarized transport have emerged as a viable beyond-CMOS option, due to their following favorable attributes: 1) nonvolatility; 2) higher logical efficiency; and 3) high integration density and compatibility with the state-ofthe-art back-end electronics manufacturing processes. These devices are the subsets of the beyond-CMOS devices that include devices based on electron spin [1] , [2] and magnetoelectric [3] , [4] phenomena.
However, spin-based devices are not competitive to CMOS [5] , in terms of switching energy and delay, due to their high energy-delay requirements to achieve deterministic switching [6] - [9] . As the switching energy or delay is reduced, their switching error probability increases, rendering them incompatible with the required determinism of the digital logic. Hence, multiple research efforts are underway to improve the energy efficiency of the spin-based implementations.
Recent attempts at improving the energy efficiency of spin-based implementations particularly focus on exploiting unique attributes of spin-based devices to efficiently implement the machine learning algorithms. The examples include exploiting domain wall magnets for analog multiplication [10] - [12] using racetrack memory structures to achieve reconfigurable precision [13] , efficient logic operations and data conversion [14] , [15] , and analog nature of spin currents for efficient dot-product computation [16] . Recently, researchers have also exploited the nanomagnet stochasticity for efficient probabilistic inference implementations. The examples include efficient realization of restricted Boltzmann machines [17] , stochastic optimization schemes [18] , probabilistic spiking neural networks [19] , and stochastic bit-stream computing [20] , [21] .
In this paper, we explore how one can significantly increase the switching error probability of spin-based logic gates in digital implementations of machine learning classifiers while maintaining their inference accuracy. This problem is akin to the classical problem formulation of achieving reliable computation using unreliable components posed by von Neumann [22] , where a reliable logic network was defined as the one whose output exhibits a probability of error p e < 0.5 when designed using -noisy logic gates, i.e., gates whose outputs are in error with probability . It was further demonstrated that a reliable logic network can be designed for any logic function provided ≤ 0.0073 and that it is impossible to do so if > 1/6. Later, tighter upper bounds on were obtained in a series of papers [23] , [24] , culminating with those of Evans and Schulman [25] . All these works do not consider the fundamental tradeoff between , energy, and delay and assume identical for all gates. Furthermore, they rely on gate-level replication to minimize the error probability of all intermediate binary signals in order to achieve a small p e , leading to a prohibitive increase in the overhead.
In this paper, we employ the Shannon-inspired model of computation [26] to enhance the tolerance of all-spin logic (ASL)-based classifier implementations to gate-level switching errors while maintaining their inference accuracy. In the Shannon-inspired framework, hardware errors are engineered and then efficiently compensated via the introduction of tailored redundancy, in the spirit of Shannon's theory for communications [27] . The contributions of this paper are as follows.
1) We characterize the -energy-delay tradeoff for ASL gates to enable nonuniform assignments across logic gates. 2) We propose logic-level path delay reallocation techniques to assign appropriate error rates to individual gates, such that the resulting output error distributions are shaped to facilitate error compensation. 3) We propose a novel maximum likelihood (ML) error compensation scheme that exploits these shaped output error statistics to compensate for the errors efficiently. 4) We demonstrate a 1000× higher average error rate tolerance and a 3× lower energy-per-decision for an ASL-based digital support vector machine (SVM) implementation while maintaining its system-level classification accuracy. The rest of this paper is organized as follows. Section II describes the relevant background, while Section III describes a modified -noisy model to capture the gate-level tradeoff between , energy, and delay. Section IV describes the proposed Shannon-inspired ASL-based SVM implementation. Section V presents the simulation results, while Section VI concludes this paper. 
II. BACKGROUND
A. ALL-SPIN LOGIC DEVICE Fig. 1(a) shows a diagram of an ASL inverter. It consists of two nanomagnets separated by a conducting channel. The input nanomagnet polarizes the supply current passing through it. This creates a spin concentration gradient and propagates the spin current in the channel. This spin current, in turn, exerts a torque on the magnetization of the output nanomagnet, forcing it to switch.
Since the nanomagnets and the spin channel are metallic, the equivalent electrical resistance across the nanomagnetchannel stack is small (few s), enabling these devices to operate at ultralow supply voltages. However, the electrical current through the input nanomagnet flows irrespective of the output activity, causing high static energy consumption. The nanomagnets, being nonvolatile, retain the magnetization vector state even when the supply current is switched OFF. Hence, [29] and [28] propose to clock these devices via a MOSFET, operating in the linear region, which acts as a switch, turning ON the ASL device only when it needs to compute, as shown in Fig. 1(a) . The ON duration T g of the clock can be externally controlled for each gate. Thus, the energy consumption of the clocked ASL gates is completely determined by T g and the ON current of the gating MOSFET. Fig. 1(b) and (c) shows the logical symbols for the clocked ASL inverter and the three-input majority gate, respectively. Reference [28] proposed to share a single MOSFET across multiple nanomagnets by electrically stacking their supply terminals in series to significantly amortize the clock pulse generation and MOSFET switching overheads. In this paper, we assume such amortization described in [28] and focus on the impact of gate-level switching errors on the final output.
B. SUPPORT VECTOR MACHINE
Linear SVM [30] is a simple and popular machine learning algorithm for binary classification. The SVM learns a hyperplane to separate the training feature vectors into two regions, each corresponding to one class, as shown in the following: the N -dimensional input feature vector, andẑ denotes the predicted label. If the true label is denoted by z, the accuracy of SVM is given by the probability of the classification error p e = Pr{ẑ = z}, which can be empirically estimated for a given data set. 
C. SHANNON-INSPIRED MODEL OF COMPUTATION
The Shannon-inspired model of computation [26] [see , one class of the design techniques within the Shannon-inspired framework [26] , [31] , introduces a statistical error compensator block as a decoder, which combines multiple unreliable outputs Y 1 , . . . , Y n to compute the corrected outputŶ . Algorithmic noise tolerance (ANT) [see Fig. 2 (c)] is a special case of SEC, where the error compensator combines two unreliable outputs y a and y e . ANT consists of the main block (MB) designed using unreliable/noisy device fabric that accounts for 85%-90% of the total gate count complexity. It strives to compute correct output y o but ends up computing y a due to the unreliability of the underlying device fabric. ANT augments the MB with a low complexity estimator that computes an estimate y e of the correct output y o . Under the assumption of the additive noise model, the MB and estimator outputs are described as follows:
where η is a system-level hardware error observed at the MB output and e is the estimation error incurred due to inherent lower complexity of the estimator. The estimator and the error compensator are designed using reliable, and hence energy-inefficient, circuits, constituting the error compensation overhead in ANT. Hence, their combined complexity (in terms of gate count) needs to be significantly (≈5-10×) smaller than the MB. Previously, it has been shown that the complexity of the error compensator can be reduced by shaping the distributions of η and e, P η (η), and P e (e), respectively, to be disparate from each other, as shown in Fig. 2 (b) and (c) [32] - [34] . In particular, a dense P e (e) is realized by introducing a reduced-precision estimator, while a sparse P η (η) is realized by permitting MSB errors in the LSB-first architectures [33] - [36] . Various design techniques to reduce the overhead of the estimator and the error compensator have been proposed [35] - [38] .
D. MUTUAL INFORMATION
The mutual information (MI) I (X ; Y ) between two random variables X and Y quantifies the amount of information conveyed about X by knowing the value of Y , and vice versa. The MI I (X ; Y ) is defined as follows:
where H (X ) and H (X |Y ) denote the entropy of X and conditional entropy of X , given Y , respectively. The entropy H (X ) of a random variable X quantifies the uncertainty about the value of X and is a function of its probability distribution.
In this paper, we use MI metric to show that the Shannoninspired model of computation (see Fig. 2 
III. MODELING STOCHASTICITY OF ASL DEVICES
In this section, we develop a gate-level model to capture the inherent device-level stochasticity of ASL at the circuit and architecture levels. Even after receiving the supply current I ON (> I crit ) at the input nanomagnet, the output nanomagnet of the ASL gate may not switch due to the presence of the Langevin thermal noise [6] - [9] , where I crit denotes the minimum current required for nanomagnetic switching. In this paper, we refer to this probabilistic event as the switching error and its probability as the switching error rate. In [6] , an analytical expression for was derived by employing the Fokker-Planck equation for magnetization vector switching dynamics governed by the fundamental LLG equation and was validated against the Landau-Lifshitz simulations of a macrospin including appropriate thermal field. This analysis indicates a gate-level tradeoff between switching error rate , the switching energy E g , and the switching delay T g of the ASL gates. Fig. 3 shows the isoerror rate delay versus energy contours of an ASL inverter at various error rates. As expected, the error rate decreases with increasing energy or delay. In fact, when I ON I crit , the expression for [6] can be simplified via the Taylor series approximation (as shown in Section I in the Supplementary Material) to
where β and ζ are the device-dependent constants described in Section I in the Supplementary Material. A three-input majority ASL gate operates with an error rate of (E g , T g ), if all its inputs are equal, and with higher error rate of (E g /3, T g ), otherwise. In this paper, we conservatively upper-bound the error rate of three-input majority gate to (E g /3, T g ). Equation (4) explains the observed linearity of the contours at higher values of E g or T g in Fig. 3 . We further note that ASL inverter consumes 8× more energy compared to the 20-nm CMOS FO4 inverter [2] at = 10 −14 and at identical switching delays. Hence, ASL-based conventional digital architectures remain noncompetitive with respect to the present day CMOS. As is increased beyond 1%, the ASL inverter becomes more energy efficient than CMOS, demonstrating the potential for achieving energy efficiency, if one can tolerate such high gate-level error rates while maintaining the final system-level accuracy.
We develop a modified -noisy gate model [see Fig. 4 (a)] to describe a clocked ASL gate, which comprehends its underlying stochastic behavior while being sufficiently abstract to permit the design and the analysis of complex ASL networks. The modified -noisy gate model captures: 1) the logic-level manifestation of device-level stochasticity;
2) the input dependence of ASL errors due to the nonvolatility of the nanomagnets, i.e., the ASL gate makes an error only when the output nanomagnet fails to switch when it should, implying a dependence of the error event on the input data; and 3) the role of the CLK terminal in the gate operation. Fig. 4(b) shows the timing diagram for the modified -noisy model. The Boolean inputs A, B, and C are applied at time t. The ASL gate generates its output Y at time t + T g , where T g is the switching delay assigned to the ASL gate. The model comprises of an ideal noise-free Boolean gate whose output M t = maj{A t , B t , C t } is EXORed with a Bernoulli random variable θ with parameter , i.e., Pr{θ = 1} = . The output selector [implemented using a multiplexer in Fig. 4(b) ] computes the final output Y t+T g by choosing either the output of the EXOR gate M t ⊕ θ or the error-free output M t . The D flip-flop models the nonvolatility, i.e., the ability to retain the output when CLK = 0. The EXOR gate output is chosen only if Y t = M t , capturing the fact that the switching error can occur only if the output nanomagnet is required to switch.
IV. SHANNON-INSPIRED ASL ARCHITECTURE
In this section, we describe how the Shannon-inspired approach can be applied to clocked ASL networks to increase their tolerance for switching errors. In Section IV-A, we propose the path delay reallocation techniques that exploit the gate-level tradeoff between , E g , and T g to shape the output error statistics and, thereby, ease error recovery. In Section IV-B, we propose a novel fusion block architecture to compensate for the switching errors.
A. SHAPING ERROR STATISTICS
In clocked digital ASL networks, the random switching errors occur at the output of every logic gate, as modeled in Section III. The impact of such gate-level errors accumulates as the input propagates to the final output. For example, consider a clocked ASL-based 8-bit ripple carry adder (RCA) consisting of all ASL gates operating at identical switching delay T g , switching energy per nanomagnet E g , and, hence, identical (E g , T g ), as shown in Fig. 5(a) . The resulting distribution P η (η) of the output error η for a 15-bit RCA is dense, as shown in Fig. 5(b) and (c), for (E g , T g ) = 10 −2 and (E g , T g ) = 10 −1 , respectively. The Brute force compensation of the errors having such distributions can be computationally expensive, as discussed in Section IV-B. We propose error statistics shaping techniques to impose a structure on P η (η) to reduce the complexity of error compensation.
We exploit the error rate, energy, and delay tradeoff of the clocked ASL gates (shown in Fig. 3 ) to shape the distribution of error η. In particular, we control the gate-level switching delay via clock pulsewidth modulation, as described in Section II-A [28] , [29] . Exploiting this degree of freedom, we propose two logic-level delay assignment steps, namely, path delay balancing (PDB) and path delay redistribution (PDR). We begin with a logic gate network with all gate delays equal to T g . Thus, the critical paths are those with VOLUME 5, NO. 1, JUNE 2019 the maximum number of gates N cp and, therefore, have the path delay T cp = T g N cp . In PDB and PDR steps, the gate delays are reassigned at a constant switching energy (per nanomagnet) of E g (moving vertically in Fig. 3 ) and at a constant throughput (identical critical path delay T cp ) as follows.
1) PDB
In PDB, delays of gates lying on the shorter paths are increased, at a constant energy E g , making every gate to lie on one or more critical paths. Thus, PDB reduces the error rate of the Xgates on shorter paths while leaving the original critical path unaltered and now containing gates with the highest error rates.
2) PDR
In PDR, the gate delays along all critical paths are further redistributed to further enhance the sparsity of P η (η) while keeping their path delay constant. In particular, the delays of the few gates in the middle of the critical path are increased (lowering ) at the expense of the reduction in the delays (increasing ) of the gates lying at the beginning and at the end of the critical path. Such delay redistribution increases the error rates of the top few MSBs and bottom few LSBs while reducing the error rates of the other bits in the middle. Doing so results in the increased probability of errors having extreme magnitudes (both very high and very low), leading to a highly sparse P η (η).
Section IV in the Supplementary Material describes PDB and PDR algorithms in detail. We define the average device error rate of the clocked ASL network as cp-avg = (E g , T cp-avg ), where T cp-avg = T cp /N cp . Note that T cp-avg = T g when all gates on the critical path have equal delay. Fig. 6(a) illustrates the spatial distribution in gate-level switching error rates (employing the color code from Fig. 3 ) for an 8-bit clocked ASL-based RCA after applying both PDB and PDR. The resulting P η (η) for a 15-bit RCA subject to PDB and PDR is shown in Fig. 6 (b) and (c) for cp-avg = 10 −2 and cp-avg = 10 −1 , respectively. Compared to the distributions in Fig. 5(b) and (c), the distributions in Fig. 6 ) = 13.98 bits, which drops to 6.18 bits, with all gates are operating at an identical error rate of cp-avg = 10 −1 . The resulting P η (η) in Fig. 5(c) is dense. The shaped error statistics in Fig. 6(c) enhances MI I (Y a ; Y o ) to 11.15 bits. Noted that there exist multiple methods of shaping P η (η) to increase the MI. Furthermore, a high value of I (Y a ; Y o ) only guarantees the existence of an error compensation scheme to reliably recover y o from y a . However, such scheme need not to be efficient. In Section IV-B, we derive a near-optimal low-complexity error compensation scheme that exploits the sparsity of P η (η). 
B. MAXIMUM LIKELIHOOD ERROR COMPENSATOR
The role of the fusion block in SEC is to compute the estimatê y of the correct output y o , as a function of two error-prone observations y a and y e [see Fig. 2(a) 
and we employ parametric models for P η (η) and P e (e) [35] , as shown in Fig. 7(a) and (b) , respectively, to simplify (6) tô
withη given aŝ
where
and f e denotes a functional description of P e when |e| < L. Detailed derivation of (7) and (8) is given in Section I in the Supplementary Material. Given y a and y e , a Brute force computation of the ML estimateŷ requires evaluating (7) by calculating the RHS of (8) for every η i and selecting η i =η that maximizes it. Fig. 7(c) illustrates the plots of f c (y ae , η) as a function of y ae for all values of η. It can be observed thatη can be approximately computed via comparisons of y ae with thresholds τ i s. Thus, the ML error compensator has a decision tree structure, as shown in Fig. 7(d) , and is henceforth referred to as a TreeCompensator. The thresholds τ i s in the TreeCompensator are the function of error distributions P η and P e . For a given implementation, these distributions can be characterized once, either during simulations, or during one-time calibration phase of the prototype chip. Once the thresholds are computed offline and stored, the TreeCompensator can be implemented efficiently using only a few subtracters. Fig. 8(a) shows the conventional serial architecture of an 120-D SVM classifier. It employs 8-bit signed BaughWooley multipliers (BWMs) and a carry save adder (CSA). All gates in this architecture operate at identical error rates. The Shannon-inspired architecture in Fig. 8(b) employs the conventional serial architecture as the MB and applies PDB and PDR to shape its output error distribution. Since PDB and PDR techniques make some gates operate at lower error rate, few reliable intermediate signals in BWMs can be employed as the estimates of the BWM outputs indicated via green reduced-precision embedded estimator (RPE-EST) blocks in BWMs, similar to techniques discussed in [36] to reduce the estimator overhead. The additional overhead consists of a CSA and a digital clocked ASL implementation of the TreeCompensator derived in Section IV-B to compute the error compensated outputŷ. The bit precisions in the estimator and the compensator blocks are primarily dictated by the number of dominant peaks in the sparse shape of the η distribution of the MB. The CSA and the compensator overhead amount to 11% of the gate complexity of the MB. We assume a low error rate = 10 −4 cp-avg for all the gates in the CSA and the TreeCompensator [marked green in Fig. 8(b) ]. We assume that the TreeCompensator computation can be pipelined since it operates only on the final outputs of the MB and the estimator. This allows the gates in the TreeCompensator to operate at lower energy since its critical path is shorter than that of the MB. More details of the Shannon-inspired architecture are described in Section II in the Supplementary Material.
C. DIGITAL CLOCKED ASL-BASED DOT-PRODUCT IMPLEMENTATIONS

V. SIMULATION RESULTS
We demonstrate the benefits of the Shannon-inspired model of computation for a digital clocked ASL architecture of SVM classifier used for the electroencephalogram (EEG)-based seizure detection. The accuracy of the classifier is captured in terms of true positive (TP) rate p TP and false alarm (FA) rate p FA , where p TP = Pr{ẑ = 1|z = 1} and p FA = Pr{ẑ = 1|z = 0}, and the probabilities are estimated empirically (via leave-one-out cross validation) [39] for the MIT-CHB EEG data set [40] by running extensive Monte Carlo simulations. We compare the Shannon-inspired architecture [see Fig. 8(b) ] with: 1) clocked ASL-based conventional serial architecture [see Fig. 8(a) ] consisting of 54 332 gates; 2) clocked ASL-based 3-MR architecture that which replicates the conventional serial architecture thrice and takes a bitwise majority vote on their outputs; and 3) 20-nm LV CMOS architecture that consists of exact same full adder-level logic network as that of the serial architecture. We compare p TP versus energy per decision and cp-avg tradeoffs at a fixed decision delay of 9.7 ns and p FA = 1%. Detailed simulation methodology is described in Section III in the Supplementary Material.
A. ACCURACY VERSUS cp-avg AND ENERGY TRADEOFF
We observe in Fig. 9(a) that the Shannon-inspired architecture [see Fig. 8(b) ] can tolerate 1000× higher cp-avg compared to the conventional serial architecture [see Fig. 8(a) ] while maintaining p TP close to that of the fixed point ideal errorfree architecture. In particular, p TP for the Shannon-inspired architecture is close to 93% even though cp-avg is as high as 1%. The 3-MR architecture tolerates an cp-avg up to 0.01%. It is greater than that of the serial architecture but worse by 100× when compared to the Shannon-inspired architecture. Furthermore, we show that the intermediate estimator-only output [y e in Fig. 8(b) ] achieves lower accuracy, emphasizing the requirement to combine the two erroneous outputs [y a and y e in Fig. 8(b) ] to achieve close-to-ideal accuracy.
The Shannon-inspired architecture achieves a 3× lower energy compared to the conventional serial architecture [see Fig. 9(b) ] while maintaining p TP = 93%. The 3-MR architecture, however, consumes 2.3× more energy than the serial architecture even though it operates at higher device error rate. This is because the energy overhead of replication offsets the energy reduction achieved by operating at a higher device error rate. However, despite its high error tolerance, the Shannon-inspired architecture still requires 1.7× more energy compared to the 20-nm LV CMOS architecture, pointing to the need to explore devices with improved energy versus error rate tradeoffs and/or the use of increasingly powerful SEC techniques [41] - [43] . We also note in Fig. 9(b) that the estimator block [consisting only of the green CSA block in Fig. 8(b) ] consumes 20% of the total energy [''Estimator Only'' curve in Fig. 9(b) ].
The reason for the effectiveness of the Shannon-inspired model in compensating for errors is the enhancement in MI I (Y o ; Y a ) due to error statistics shaping via PDB and PDR, as shown in Fig. 9(c) 
B. IMPACT OF NONIDEALITIES AND PROCESS VARIATIONS
Next, we evaluate the tolerance of the proposed Shannoninspired architecture to various practical nonidealities, such as a finite number of distinct clock pulsewidths, process variations, and clock pulsewidth variations. While PDB and PDR can potentially assign a unique delay to each gate, in practice, those delays need to be further quantized to take one value out of the finite set of available distinct clock pulsewidths. Fig. 10(a) shows the p TP versus cp-avg curves for the Shannon-inspired architecture after quantizing the ideal clock pulsewidths to 46 distinct pulsewidths for the SVM implementation [see Fig. 8(b) ] consisting of 54 332 gates. The number of distinct clock pulsewidths is of the same order as the number of gating domains explored in [28] . We observe negligible deterioration in the accuracy of the Shannon-inspired architecture (in cp-avg < 1% regime). Such gate clock pulsewidth quantization enables amortization of the clock pulse generation circuitry, including the sharing of the clocking transistors across different nanomagnets [28] . The clock network design is further simplified, since the quantized clock pulsewidths are integer multiples of the shortest reference clock, and multiple parallel dot products (in applications, such as filter banks and neural networks) can share a single clock generation circuitry.
Process variations present an additional challenge in beyond-CMOS systems. We evaluate the tolerance of the Shannon-inspired approach to static within-die variations in three device parameters, namely, energy barrier E b , damping coefficient α of the nanomagnets, and clocking transistor ON current I ON . We observe in Fig. 10(b) that the Shannoninspired architecture with quantized clock pulsewidths can tolerate a 3(σ/µ) variations of up to 24% in each of the three device parameters. When dynamic variations in the clock pulsewidths are included in addition to their quantization and process variations, we find in Fig. 10(c) that the Shannoninspired architecture can tolerate a maximum deviation (β) of 20% of the minimum clock pulsewidth (T g,min ).
VI. DISCUSSION
In this paper, we demonstrated the benefits of employing the Shannon-inspired model of computation to enhance the tolerance of digital clocked ASL implementations to random gate-level switching errors. While it improves the energy efficiency of digital clocked ASL-based implementations, the same approach can be applied to many other spintronic devices, such as MESO [3] and CoMET [4] , as long as they use nanomagnet switching for information processing. The Shannon-inspired techniques have been previously applied to CMOS implementations to further reduce their energy consumption via voltage overscaling [32] , [42] . In contrast, ASL/spintronics provides a new way of trading of stochasticity with energy by realizing this energy-accuracy tradeoff at the device level. The Shannon-inspired approach can enhance the ability to perform reliable computation on stochastic device fabrics to enable the use of a highly error prone but scalable physical device.
