Mixed-signal machine-learning classification has recently been demonstrated as an efficient alternative for classification with power expensive digital circuits. In this paper, a high-COnfidence high-REsolution (CORE) mixed-signal classifier is proposed for classifying high-dimensional input data into multiclass output space with less power and area than state-of-the-art classifiers. A high-resolution multiplication is facilitated within a single-MOSFET by feeding the features and feature weights into, respectively, the body and gate inputs. High-resolution classifier that considers the confidence of the individual predictors is designed at 45 nm technology node and operates at 100 MHz in subthreshold region. To evaluate the performance of the classifier, a reduced MNIST dataset is generated by downsampling the MNIST digit images from 28 × 28 features to 9 × 9 features. The system is simulated across a wide range of PVT variations, exhibiting nominal accuracy of 90%, energy consumption of 6.2 pJ per classification (over 45 times lower than state-of-the-art classifiers), area of 2,179 µm 2 (over 7.3 times lower than state-ofthe-art classifiers), and a stable response under PVT variations.
Abstract-Mixed-signal machine-learning classification has recently been demonstrated as an efficient alternative for classification with power expensive digital circuits. In this paper, a high-COnfidence high-REsolution (CORE) mixed-signal classifier is proposed for classifying high-dimensional input data into multiclass output space with less power and area than state-of-the-art classifiers. A high-resolution multiplication is facilitated within a single-MOSFET by feeding the features and feature weights into, respectively, the body and gate inputs. High-resolution classifier that considers the confidence of the individual predictors is designed at 45 nm technology node and operates at 100 MHz in subthreshold region. To evaluate the performance of the classifier, a reduced MNIST dataset is generated by downsampling the MNIST digit images from 28 × 28 features to 9 × 9 features. The system is simulated across a wide range of PVT variations, exhibiting nominal accuracy of 90%, energy consumption of 6.2 pJ per classification (over 45 times lower than state-of-the-art classifiers), area of 2,179 µm 2 (over 7.3 times lower than state-ofthe-art classifiers), and a stable response under PVT variations.
Index Terms-machine learning hardware, mixed-signal classifiers, confidence-level, high resolution, high-dimensional data, multi-class classification, linear classifiers, logistic regression, subthreshold.
I. INTRODUCTION
L OW power machine learning (ML) classifiers are playing an important role in enabling edge computation of ML algorithms. Wide variety of applications benefit from these compact classifiers, such as, Internet of Things (IoT) devices, wireless sensor networks (WSNs), smart home technologies, autonomous transportation, and security systems.
Existing on-chip classifiers can be categorized into two major domains: digital and mixed-signal [1] . A digital classifier is typically fed with binary inputs (i.e., features) and uses binary feature weights, all obtained by sampling and quantizing corresponding analog signals. The classification accuracy with digital classifiers increases with the increasing number of bits assigned for features and weights. These highly accurate digital classifiers however exhibit significant power consumption and physical size and are often not suitable for power limited applications, such as battery powered sensors and those other IoT devices that are wirelessly powered and powered from harvested energy. Alternatively, mixed signal classifiers aim to reduce the area and power consumption of the conventional digital classifiers by directly using the analog input data for classification [2] . The inherent need for data conversion with power hungry analog-to-digital converters (ADCs) is therefore mitigated with mixed-signal classifiers [2] .
Another concern in modern classifiers is the high dimensionality of data. Classifying data in a high-dimensional space often results in a prohibitively high data movement among memory and computing circuit components. This, in turn, significantly increases power consumption in both, mixedsignal and digital classifiers [3] - [15] . To reduce the data movement, several approaches for in-memory ML computation have recently been proposed. Recent state-of-the-art inmemory classifiers typically exhibit accuracy between 90% and 96% and energy dissipation in the range of 210 pJ/decision to 879 pJ/decision [16] - [19] for typical image recognition datasets. Emerging device technologies are also being considered for providing power and area efficient alternatives for the conventional CMOS based classifiers. Accuracy of 90% and energy of 25 pJ per decision has been recently reported in [20] , [21] .
To enable high-resolution feature-weight multiplication, a theoretical framework that comprises circuits, models, and algorithms is proposed. To the best of the authors knowledge, this paper is the first to report a mixed-signal high-resolution classifier, utilizing MOSFET body terminals. The schematic of the proposed configuration is shown in Fig. 1 . With this approach, body bias of the MOSFETs is controlled by the individual ML features, the gate inputs are fed by the absolute value of a corresponding feature weight, and the sign of the weights is taken into account by considering separate lines for the positive and negative feature and feature weight product. A K-class classification with N features is therefore realized with N rows and K columns multiplication and accumulation (MAC) array, where each column serves as an independent binary classifier. These individual binary classifiers are combined using one-versus-all technique [22] , requiring 1 2 (K − 1) times less binary classifiers than the state-of-the-art classifier in [16] and orders of magnitude less transistors, as compared with other existing mixed-signal classifiers [17] , [19] .
Another primary contribution is in decision making domain. With analog classifiers, binary predictions are typically made based on the relative magnitude of signals between the positive and negative sensing lines. With this traditional approach, the sensing line with the highest voltage drop is assumed to exhibit the correct classification and the confidence level of the decision is not considered. Alternatively, a small difference in these voltage drop values indicates low confidence level of the individual binary predictions and thus, a higher probability of an erroneous final decision. To address this ML integrity issue, a confidence driven classification is proposed in this paper. With this approach, the difference among the magnitudes of the sensing line voltage drops is considered for capturing the confidence of the individual predictions.
Finally, the proposed system is designed in subthreshold region, exhibiting a power efficient alternative for the traditional classifiers. The classifier is demonstrated on Modified National Institute of Standards and Technology (MNIST) dataset [23] of 10-class digit images. Based on SPICE circuit level simulation results, MNIST data is classified with 90% accuracy using 81 × 10 transistors and exhibits power consumption of only 6.2 pJ per decision.
The rest of the paper is organized as follows. In Section II, the proposed high-resolution binary classifier and linearization technique are described. Fabrication considerations are also discussed in this section. Based on the proposed binary classifier, a multi-class high-resolution classifier is designed and demonstrated with MNIST dataset, as described in Section III. Confidence driven classification is also explained in Section III. Circuit design and simulation results of the multi-class high-COnfidence high-REsolution (CORE) classifier using one-versus-all technique are presented in Section IV. The paper is summarized in Section V.
II. THE PROPOSED LINEAR BINARY CLASSIFIER
In this section, the proposed linear binary classifier is described. The software level design framework is provided in Section II-A. The circuit, and fabrication level considerations are presented in, respectively, Sections II-B and II-C.
A. Design Framework
Reliability, power consumption, and physical size of onchip classifiers are all primary concerns in modern ML ICs. The proposed framework is designed to meet accuracy specifications of modern classification problems in a cost effective manner. Linear algorithms are exploited in this paper for training a supervised binary classifier, optimizing the system for linearly separable input data. With a multivariate linear classifier, the system response Z is a linear combination of N input features x = (x 1 , x 2 , ..., x N ) and model weights w = (w 1 , w 2 , ..., w N ),
(1)
The model weights are determined during supervised training by minimizing the prediction error between the system response, Z, and a corresponding true value in the labeled training dataset. Logistic regression (LR) -a common supervised linear ML model -is used for training the proposed classifier based on gradient descent algorithm [24] . LR is preferred due to its simple implementation and superior performance on MNIST dataset as compared with other classifiers. In inference, a probability threshold of 0.5 is used for predicting system response to input data, exhibiting a simple on-chip implementation,
The described logistic regressor with the probability threshold of 0.5 is referred to as logistic classifier. The accuracy of the proposed logistic classifier is evaluated as a percentage of all the correct predictions out of the total number of test predictions. The proposed ML flow and the preprocessing steps are explained below.
1) Dataset:
MNIST database is a large image dataset, commonly used for evaluating the effectiveness of ML hardware. MNIST contains images of 70,000 handwritten digits, ranging between 0 to 9. Each digit comprises 784 (28 × 28) image pixels. The training and test datasets comprise, respectively, 60,000 and 10,000 digits. Out of the 60,000 training observations, 45,000 and 15,000 digits are used for, respectively, training and validating the proposed system.
2) Feature selection and downsampling: Each image pixel of the individual digits in the training set is considered as an ML feature and used for training the classifier. To reduce the power and area overheads, those redundant features that are not essential for digit classification are eliminated. To determine the preferred number of observed features, the dataset is downsampled to N ≤ 784 features (N = 6 2 , 7 2 , . . . .28 2 ) and classification accuracy is obtained for the downsampled data in Python. To efficiently classify MNIST digits with the proposed classifier, N = 81 (9×9) is preferred, corresponding to 90% accuracy. The original and downsampled (N = 81) digit images are shown in Fig. 2 .
B. Circuit Level Considerations
The primary goal in a linear binary classification is to accurately and efficiently perform the dot product operation of the features and feature weights, as described in (2). To simplify the circuit level design, the result shown in (2) is formulated as the signed addition of positive, Z + , and negative, Z − , feature-weight products,
The individual positive and negative feature-weight products are accumulated within the positive, V + sen and negative V − sen sensing lines, yielding the basic ML multiplication and accumulation (MAC) operation, as shown in Fig. 1 . For each feature-weight multiplication, a single-MOSFET is connected to either the positive or negative sensing line, as determined by the sign of the corresponding feature weight. For example, for w 1 < 0, the corresponding multiplier MOSFET is connected to the negative sensing line.
To capture the voltage drops across the sensing lines, sensing capacitors (i.e., C sen ) are connected to the individual sensing lines. The size of a sensing capacitor is determined proportionally to sensing line current, increasing with the number of transistors (i.e., features) connected to the line. To classify tasks with higher number of features, larger sensing capacitors are therefore required, limiting the scalability of the system. To provide a power efficient and scalable solution, the transistors are biased in near/subthreshold operation region, significantly limiting the current through the sensing lines. In the case of MNIST classification with 9 × 9 features, capacitance of only 50 fF is utilized per sensing line.
A primary concern with near/subthreshold operation is the exponential dependence of the drain current on the body and gate biases [25] ,
where I t is the sub-threshold current at V gs = V th , n is the sub-threshold slope, and V T is the thermal voltage. Note that body voltage dependence is embedded in the threshold voltage,
Considering that V DS >> V T , the expression in (4) can be simplified as,
To mitigate the non-linear dependence of the drain current on the weight-feature dot product (see (1) ), a novel training flow is proposed (see Fig. 3 ). To account for the non-linear dependence of the drain current on the bias voltage, the model is trained with square root values of the default features (x i → √ x i )). Thus, the extracted feature weights, w, are optimized for classifying the MNIST dataset transformed into half-order polynomial space. Alternatively, to account for the non-linear dependence of the drain current on the gate voltage, the feature weights are logarithmically adjusted case is expressed as,
In inference, the current model is exploited for making prediction based on the square root values of the original features, as trained offline, yielding 90% accuracy across the MNIST test set, as detailed in the following sections.
C. Fabrication Costs
In the proposed linear binary classifier, the body and gate terminals are fed by, respectively, the input features and corresponding feature weights. Each multiplication is, therefore, executed by a single-MOSFET, significantly reducing the power and area costs (despite the overhead of the triplewell technology) and complexity (as determined by number of transistors) of the classifier in comparison to the existing state-of-the-art mixed-signal classifiers [2] , [16] .
Conventional twin-well fabrication process is illustrated in Fig. 4(a) . This process is designed to provide a single voltage connection to all the n-type and p-type body terminals. Alternatively, to connect body terminals of the individual multiplier transistors to different voltage levels, a specialized fabrication process is required. One way to individually bias numerous body terminals, is by fabricating with triple-well process (see Fig. 4(b) ), which is commonly used in high-performance, low-power ICs [26] , [27] and for reducing substrate noise in mixed-signal circuits [28] . In a p-substrate triple-well process, an additional deep n-well is used to isolate the p-well of each MOSFET from the p-substrate, allowing an independent body terminal connection for each MOSFET. The triple-well structure has been demonstrated to provide better noise characteristics as compared with the traditional twin-well structure, without increasing the gate leakage [29] . Alternatively, the triple-well structure, exhibits additional fabrication costs and area overheads that needs to be considered. Layout of a fourtransistor block in twin-well and triple-well process is shown in, respectively, Fig. 4(c) and Fig. 4(d) . With the triple-well configuration, the area is increased 3 times as compared with the twin-well process in 45 nm CMOS technology.
III. CORE BASED MULTI-CLASS CLASSIFIER
A multi-class classifier is designed based on multiple linear binary classifiers, as presented in Section II. Confidence driven approach that addresses the integrity of multi-class classification is described in Section III.A. The transistor level implementation of the proposed CORE classifier is presented in Section III.B.
A. Confidence Driven Classification: OVO versus OVA
Two typical approaches for designing a multi-class classifier based on multiple binary classifiers are one-versus-one (OVO) and one-versus-all (OVA) [30] . With OVO approach, all pairwise combinations of the output classes are evaluated with the individual binary classifiers. Thus, a K-class classification with OVO approach requires 1 2 K(K − 1) binary classifiers, increasing the system complexity and power and area costs quadratically with the number of classes. For example, for classifying 10 digits with OVO approach, 45 i-verus-j, (i, j ∈ {0, 1, 2, .., 9}) binary classifiers are required. Alternatively, in case of a 1,000 class classification, about half a million of binary classifiers are required. The final decision with OVO technique is extracted using majority voting approach [31] : each binary classifier votes independently for a certain class and the final decision is made based on the class with highest number of votes. Alternatively, with OVA approach, each binary classifier discriminates between a single class and the rest of the classes. The required number of binary classifiers with OVA increases linearly with the number of classes, facilitating a more power and area efficient classification of high-dimensional data. In addition, a probability score of correct discrimination is inherently provided with OVA and can be extracted for the individual binary classifiers. Thus, a K-class OVA classifier can be designed with as few as K binary classifiers, seamlessly accounting for the confidence level of the individual predictions. The confidence driven decision, d, with the OVA classifier is determined as,
where p i is the confidence level of the i th binary classifier. From circuit level perspective, the confidence level with OVA approach can be determined as the difference of the voltage drops across the positive and negative sensing lines, e.g., p i ∝ ∆V + sen (i) − ∆V − sen (i) ∆V sen (i).
To extract the class with highest confidence level, a lightweight comparator is designed, as described in the next subsection, yielding a power and area efficient alternative for the traditional voter circuits.
B. Circuit Level Design and Simulation Results
The proposed multi-class classifier is designed in SPICE and demonstrated based on the reduced MNIST dataset. The OVA circuits and architecture of the MOSFET array are described in this section.
1) MAC Array: To classify the downsampled 9 × 9 MNIST digits, ten binary classifiers (see Section II) are co-designed in SPICE, yielding a 81 × 10 MOSFET array. Each of the transistors within the MAC array is exploited for generating a single feature-weight product. During inference, the V + sen and V − sen lines are precharged to V DD prior to each prediction. All the input features and feature weights are connected simultaneously to, respectively, the body and gate terminals of the individual multiplier transistors, facilitating a parallel classification process within all the ten binary classifiers. As a result, 20 different voltage drop values (i.e., V + sen (i), V − sen (i), i = 0, 1, . . . , 9) are generated on the individual sensing lines, as shown in Fig. 5 . The voltage waveforms of the positive and negative sensing lines are illustrated by, respectively, the blue dotted and solid red lines. These voltage drops are exploited by the confidence level extractor, generating the final classifier decision. The confidence level of each binary classifier (as determined based on (8)) is shown in Fig. 5 , as noted on top of each plot. For example, for digit 7, the 7 th (7-vs-all) binary classifier has as expected the highest confidence level. The final decision is generated by the confidence level extractor.
2) Confidence Level Extractor: The schematic of a single confidence driven selector for a multi-class classification is presented in Fig. 6(a) . For a K-class classification, 1 2 K(K −1) confidence driven selectors are required, yielding a total of 45 selectors for the MNIST dataset (K = 10). The circuit is designed to compare the confidence levels of the binary classifiers (i th and j th classifier) and determine the classifier with higher confidence level, ∆V + sen (i) − ∆V − sen (i) <> ∆V + sen (j) − ∆V − sen (j). (9) To simplify the circuit level implementation the subtraction in (9) is replaced with summation,
Each summation in (10) is captured with two parallel NMOS transistors, as shown in Fig. 6(a) . To capture the result of ∆V + sen (i) + ∆V − sen (j), the gate terminals of the transistors M 1 and M 2 are connected to, respectively, the positive sensing line of the i th classifier, V + sen (i), and negative sensing line of the j th classifier, V − sen (j). As a result, the drain current at M 1 and M 2 is proportional to ∆V + sen (i)+∆V − sen (j). Similarly, the gate terminals of the transistors M 3 and M 4 are connected to, respectively, V + sen (j) and V − sen (i), generating a drain current proportional to ∆V + sen (j) + ∆V − sen (i). To determine which side exhibits higher confidence level (i.e., sinks lower current), two back to back inverters are utilized. Voltage waveforms of the sensing lines, EN signal which enables the back-to-back inverters, Reset signal which resets the voltages stored on both sides of the back-to-back inverters and output signals are illustrated in Fig. 6(b) for six consecutive classification periods. During the second, third, and fifth classifications, the left side of the inverters sink higher current than the right side. The left and right sides of the confidence selector, are, therefore, forced to, respectively, the low (i.e., D(i) = 0) and high (i.e., D(j) = 1) output voltage. Alternatively, during the first, fourth, and sixth classifications, the right side sinks higher current than the left side. Thus, left and right sides are forced to, respectively, the high (i.e., D(i) = 1) and low (D(j) = 0) output voltage.
The correct functionality of the confidence level extractor depends on its symmetric structure and is highly sensitive to process variations. To mitigate process variations, larger pull-down (W = 5W min ), pull-up (W = 15W min ), and confidence extractor (W = 5W min ) transistors are utilized for the confidence level extractor. With upsized transistors, the average accuracy degradation of the classifier is limited to 2% under process variations, as described in the next section. Note that conventional methods such as extracting final classification results with K analog-to-digital (ADCs) can also be leveraged, trading-off power efficiency for scalability (i.e., 1 2 (K − 1) times less confidence extractors). 3) Resistive Voltage Divider: The trained, quantized feature weights are generated using a resistive voltage divider (see Fig. 7 ). In this configuration, the preferred voltage range (V low , V high ) is divided into 2 n equal steps, where n is the preferred quantization resolution. The gate bias range (i.e., (300 mV, 610 mV) is quantized with 5-bit resolution into 31 equal steps of 10 mV, as illustrated in Fig. 7 . Poly resistors with sheet resistance of 7.8 Ω/ are utilized. The resilience of the voltage divider to process variations is evaluated with 1Krun Monte-Carlo simulation. Based on the simulation results, the circuit is highly resilient to process variations, exhibiting the average deviation of 0.2 mV from the nominal wight values. The results of the 1K-run Monte-Carlo simulation are also shown in Fig. 7 .
With this topology, no memory and data conversion units are required for storing and quantizing the weights. Alternatively, the weights provided with the voltage divider are not reconfigurable. Using a 32-to-1 multiplexer and a memory unit (e.g., SRAM), the circuit can be updated to provide reconfigurable feature weights [32] . The overall area of the classifier with reconfigurable weights is expected to increase by a factor of 4.2.
IV. RESULTS

A. System Characteristics
A schematic of the integrated system is illustrated in Fig.  8 , comprising voltage divider, MOSFET array, and confidence Fig. 9 (a) and Fig. 9(b) , exhibiting equal accuracy of 90%. The ML classifier generates predictions at 100 MHz frequency, exhibiting an average energy consumption of 6.2 pJ per classification of a single digit. To maintain high prediction accuracy, 5 bits and 6 bits are assigned for quantizing, respectively, the feature weights and input features. By increasing the dimensionality of the proposed classifier, lower power and area overheads can be traded off for higher prediction accuracy, approaching the theoretical limit of 92% for MNIST classification with linear ML algorithms and OVA decisioning scheme. The existing tradeoffs between dimensionality of the data and the accuracy, power, and area overheads are shown in Fig. 10 . The knee point of N = 81 is selected in this paper to provide satisfactory accuracy results in an power and area efficient manner.
B. Simulation Results
Performance characteristics are listed in Table I for the proposed system along with the existing state-of-the-art mixedsignal classifiers [2] , [16] , [17] . Note the different dataset, MIT-CBCL, used in [17] . Classification accuracy is a strong function of data. The accuracy comparison among [17] and other classifiers in Table I is therefore less valuable, albeit the excellent accuracy demonstrated in [17] . Alternatively, benefiting from the high-resolution multiplications and confidence driven predictions, CORE classifier exhibits significantly less transistor count, and thus lower power consumption and smaller IC area, as compared with the other state-of-theart classifiers. For fair comparison, current time per decision Table I . Note that the operational frequency is scalable and can be adjusted based on the application needs and constrains.
To evaluate the effect of voltage and temperature variations on the CORE classifier, the supply voltage is varied between 0.6 volts and 1.2 volt and the temperature is varied between −30 • C and 125 • C. The effect of process variations on the classifier performance is evaluated based on a 1K-run Monte-Carlo simulation on a randomly selected 100-observation (10 images per digit) balanced test set with nominal accuracy of 90%. Note that a 1K-run Monte-Carlo simulation on the whole test set takes 1, 000 × 2.5 hours on Intel Core i7-7700 CPU. The results of the simulations are shown in Fig. 11 . The classifier exhibits no sensitivity within wide range of voltage variations from 0.8 V to 1.2 V. Less than 2% accuracy degradation is observed at low temperatures −30 • C ≤ T ≤ 0 • C. No sensitivity to temperature variations for 0 • C ≤ T ≤ 125 • C is observed. An average of 2% accuracy degradation is exhibited due to process variations, as extracted from the 1K-run Monte-Carlo simulation.
Confidence histograms of the correct and incorrect classifications are shown in, respectively, Fig. 12 (a) and Fig.  12(b) . With the proposed confidence driven approach, incorrect classifications often exhibit lower confidence as compared with typically confident, correct predictions. The odds of an incorrect classification to be corrected under process variations are therefore high, favorably affecting the resilience of the system to variations. Based on the simulation results, the accuracy is improved for nearly one-third of the Monte-Carlo runs.
V. CONCLUSION
Several state-of-the-art mixed-signal classifiers have recently been demonstrated for power efficient classification. Accurate classification of multi-dimensional data under the tight power and area constraints is the primary objective in modern on-chip classifiers. A novel circuit topology is proposed in this paper for high-COnfidence and high-REsolution (CORE) classification. With this topology, body terminals of the MOSFETs are exploited to encode input features, enabling the high-resolution classification.
To enhance the ML integrity in multi-class classifiers, OVA technique is exploited for efficiently extracting a final decision based on the confidence level of the individual predictors. For a K-class classification, (K −1)/2 times fewer binary classifiers are required with the OVA approach as compared with the traditional OVO method [16] . To further reduce area and power consumption of the OVA-based CORE classifier, a lightweight confidence extractor is designed, generating the final decision based on the confidence level of the individual binary classifiers. To the best of the authors knowledge, the proposed CORE classifier is the first integrated system to successfully classify MNIST dataset in subthreshold region using a single-MOSFET MAC. Biasing transistors in subthreshold region significantly decreases the leakage and dynamic currents as well as overall load on the sensing lines.
The proposed CORE classifier is designed in SPICE and simulated in 45 nm standard CMOS process. The performance and functionality of the proposed approach is validated with simulation results, exhibiting 90% classification accuracy with 6.2 pJ energy consumption per prediction across the MNIST dataset. Each prediction is finalized within a single clock cycle of 10 ns. The unique topology of CORE classifier supports the ML integrity under a wide range of PVT variations, as well as system scalability across technology nodes.
