This letter presents an energy-efficient VLSI architecture for SVM classification. Instead of accurate calculation, cost-reduced computing elements based on approximative techniques are designed to complete computation-intensive operations in the SVM-based classifier to save energy and resources. Besides, a partial parallel structure is applied to eliminate dimensional constraints for inputs of classifiers and balance between classification speed and energy consumption. We adopt 55-nm CMOS process to implement the proposed design. It occupies 0.0901 mm 2 area and consumes 15.9 mW at operating frequency of 100 MHz and from an operating voltage of 1 V. Experiment shows that the design provides an area reduction by 41.5% and a significant saving in energy efficiency by 61.8% compared with the baseline model.
Introduction
Machine learning is a popular topic in this data-intensive era, as regularities hidden behind big data can be found by using machine learning algorithms in various domains. Among these algorithms, support vector machine (SVM) has been widely used due to its computational efficiency and robustness [1] . However, it becomes difficult to design an energy-efficient SVM-based classification system in many real-world problems, as non-linear classifiers are needed to achieve high classification performance on complex datasets, which leads to producing more support vectors and increasing energy and resources consumption.
Therefore, schemes are proposed to solve the problems aforementioned. Some methods aim to obtain a similar but easy-implementing model by adjusting its parameters, such as parameters of classifiers are rounded off to the nearest power of two values and multipliers in the corresponding hardware design are replaced with shift operations to reduce both area and power [2] . However, it has a negative impact on the performance of classifiers in many applications. Some methods aim to find a simple and suitable classifier in specialized applications, such as two linear SVM-based classifiers instead of a non-linear SVM-based classifier are utilized to enhance both the sensitivity and specificity simultaneously in the patient-specific application [3] . There are also some methods aiming to improve the structure of classifiers, such as a cascaded classifier is put forward for applications where data distribution is between-class imbalanced [2, 4, 5] . In this situation, multiple SVMbased classifiers with various performance are arranged in order according to the computational complexity as well as accuracy. A large amount of data can be classified with less calculation in early stages, which results in significant energy saving compared with monolithic SVM classification [5] . Nevertheless, classification with complex calculation in later stages will be bottleneck if the complexity of the hardware architecture is not reduced.
In this letter, we propose a flexible hardware architecture with high-accuracy and low-energy features for the non-linear SVM-based classifier. A new costreduced computing component based on approximative techniques is designed to optimize the computation-intensive logic of the architecture by replacing these logic with our proposed one. Furthermore, to improve the flexibility and applicability of the architecture, a partial parallel structure with optimal architecture parameters is adopted in our design to eliminate the dimensional constraint of classifiers' inputs.
2 Overview of support vector machine SVM is widely applied to many real-world classification applications [6] and it has become an extremely successful discriminative classifier for two-class problems, where it is common to label one class with minor samples as a positive and the other one with major samples as a negative. The goal of the SVM algorithm is to find a hyperplane which has the maximum margin from the two classes and can separate the data samples of the two classes efficiently. To get the classification function of SVMs, informative samples which determine the shape of the hyperplane need to be found and these samples are called support vectors (SVs). In addition, kernel functions and slack variables are applied to satisfy the requirements of solving non-linear classification problems. The most commonly used kernel functions include linear, polynomial, and radial-basis function (RBF). In this letter, RBF-based SVM is chosen in terms of its strong applicability and generality. The final classification decision function is shown as follows:
where i is the Lagrange multiplier, y i is the class label of a SV, x i represents a SV, x is the input vector, Kðx i ; xÞ is the kernel function, and b is the bias.
3 Proposed hardware architecture
The main concern to implement an energy-efficient hardware architecture of the RBF-based SVM classifier is the low-energy realization of RBF. As we can see from Eq. (1), there are two difficulties to achieve the goal, which are squareintensive vector operations and hardware-unfriendly exponential function. In this letter, we design cost-reduced but appropriate components based on approximative techniques to overcome the difficulties aforementioned.
Squaring function approximation
The squaring function is a widely used fundamental arithmetic operation and its exact result can be obtained by using look-up table (LUT) or multiplier which are the two commonly-used methods. However, effects of circuit complexity become tremendous as the bit-width of inputs increases. Recently, squaring function approximation is presented without a significant influence on the performance of algorithms. Linear approximations with only simple operations such as shift, concatenation, and addition are proposed [7] . A set of recursive boolean equations is put forward to approximate squaring functions [8] . [9] proposes an approximation based on a simple logarithmic interpolation. However, these existing methods are not suitable for the hardware implementation of SVM classifiers, as errors make the classification performance drop heavily. Therefore, we put forward a new compensative approximate approach with lower errors to represent a two's complement n-bit data. The data A and its corresponding squaring function can be represented as
where s represents the sign of A, A s represents the result of a bitwise exclusive OR on each bit of A and s and its representation is
The basic idea of the approach is to extract one basic part and two compensative parts from the representation of squaring functions. A general outputs of the approximate squarer are simplified as a set of Boolean equations for various bit-width inputs and they are shown as follows:
where n is greater than or equal to 7. When n is less than 5, only P basic is used, P com1 and P com2 are used to approximate the squaring function otherwise.
Exponential approximation
The dedicated hardware implementation of exponential functions is required to satisfy the energy and resources constraints. In the literature, a number of works have been proposed. The Taylor series expansion is one of the oldest and most widely used methods. However, higher order factorial needs to be calculated when higher accuracy is wanted, which will result in more energy and resource consumption. An alternative method is to use LUT, but it is limited by the range of inputs and accuracy of results.
In this section, we apply two steps to get an approximation for the exponential function. Firstly, the exponential term is converted to a power of 2. Secondly, piecewise linear approximations are used to approximate a power of 2 where the input range is split into several segments and each of them is linearly approximated. The approximation works well as all inputs are negative values in SVM classification and detailed representations are shown as follows:
where y is equal to 1:5x, floorðyÞ means the largest integer not greater than y, and delt is the fractional part of y.
Approximation-based hardware architecture
A feature-based partial parallel architecture (PPA) is presented in Fig. 1 to complete the computation between a testing sample and a SV. The main computational units include a partial parallel vector unit (PPVU) and an exponential function unit (EFU), which are described in details below: 1) PPVU: this unit is responsible for calculating Àkx À yk 2 Â , where x ¼ ðx1; x2; . . . ; xnÞ represents a n-dimension testing vector, and y ¼ ðy1; y2; . . . ; ynÞ denotes a support vector. Norm squares are computed by approximate square units instead of multipliers, which helps to save energy and resources. An adder tree is applied to add these square values up in the end. In terms of various dimensional datasets, a partial parallel architecture is used to eliminate the effect of data dimension. Therefore, the final sum of norm squares can be got after several iterations and the number of iteration can be calculated using the equation shown in Eq. (9) , where N f and n p represent the dimension of inputs and the feature-based parallelism of the proposed architecture, respectively. As the parameter γ is a power of two, remaining operations can be completed with a shifter and an adder to obtain the target value.
2) EFU: this unit is responsible for calculating expðxÞ. As is aforementioned, the exponential function is transferred and can be represented as a new approximation. In order to implement the exponential approximation, a small area of hardware is occupied as only an adder and a shifter will be used. The classification speed of SVM-based classifiers is dominated by the number and dimension of SVs and only the influence of dimension has been eased previously. Therefore, a sample-based parallel architecture is designed based on the feature-based architecture to ease the effect of SVs' number and it is shown in Fig. 2 . The architecture can handle several SVs simultaneously, which can further improve the classification speed of SVM-based classifiers obviously.
Experiment
In this section, the main contents include evaluating the effect of approximative techniques on the classification performance, exploring optimal architecture parameters of the proposed design, and presenting corresponding hardware implementation details at last.
Experiment for the approximate model
To evaluate the classification performance, twenty datasets from KEEL dataset repository [10] are selected. These datasets are chosen according to the dimension of the input vector and imbalanced ratio. Some datasets are multi-class, therefore, they will be transformed into two classes for the need of the research. Table I shows the detailed characteristics of the datasets, which contains the total number of datasets (#Sample), the dimension of inputs (#Dim), and the negative-to-positive imbalanced ratio (#IR).
We adopt geometric-mean (G-mean), area under the receiver operating characteristics curve (AUC) [11] , and sensitivity as the metrics to assess the results of classification model obtained from the SVM algorithm. Besides, to be consistent with hardware implementation, all parameters of the model are quantized according to Table II . These quantization parameters are chosen in terms of the influence of quantization on the classification accuracy and the quantized model is evaluated on the various datasets. Fig. 3 shows the classification error results of the classifier between with and without approximative techniques according to metrics aforementioned. It is obvious that the classification performance is more or less affected by approximative techniques. However, the maximum error rate occurs according to G-mean and it is only 4.4%. Besides, performance of some datasets is even improved because the fluctuation of classification results caused by approximation is affected by the dimension and number of SVs and hence the orientation of performance varies as the two factors vary.
Architecture parameter exploration
Different parallel degrees have influence on hardware implementation costs which include latency, energy and resource consumption. Therefore, architectures with various parallel degrees are implemented and experiment is explored to evaluate the influence. Besides, power delay product (PDP) is used to represent the energy consumption. Fig. 4 shows results of the feature-based parallel architecture based on different degrees of feature parallelism. In terms of results, we choose the degree of feature parallelism equal to 6 as the energy consumption reaches the optimal value with minimal resources and nearly stable latency. To achieve the improved parallel architecture, Fig. 5 shows variation trends of its hardware implementation. According to the trends, we choose the improved architecture with sample-based parallel degree equal to 4 as the energy consumption is relatively low and the latency is reduced at the same time. Once parameters of the parallelism degree are determined, our energy-efficient architecture is determined as well. The proposed architecture is designed in the Verilog HDL and synthesized using a commercial 55 nm CMOS standard cell library. We follow the typical ASIC design flow to perform the synthesis, floor-plan, place, and routing. Parasitic extraction is done after the layout generation. Fig. 6 shows the layout and implementation details of our design. Besides, Table III provides a comparison of implementation details between the proposed architecture and architectures in earlier published papers. The baseline design in the table is implemented with multipliers in PPVU and three-order Taylor expansion combined with region constriction in EFU. In terms of comparison results, our proposed architecture can save resources by 41.5% and reduce energy consumption by 61.8% compared with the baseline design. Although the power consumption of designs in [12] is less than ours, our architecture has higher parallel degree and can complete the classification with less latency. 
