This paper presents a hardware accelerator for sparse decision trees intended for FPGA applications. To the best of authors' knowledge, this is the first accelerator of this type. Beside the hardware accelerator itself, a novel algorithm for induction of sparse decision trees is also presented. Sparse decision trees can be attractive because they require less memory resources and can be more efficiently processed using specialized hardware compared to traditional oblique decision trees. This can be of significant interest, particularly, in the edge-based applications, where memory and compute resources as well as power consumption are severely constrained. The performance of the proposed sparse decision tree induction algorithm as well as developed hardware accelerator are studied using standard benchmark datasets obtained from the UCI Machine Learning Repository database. The results of the experimental study indicate that the proposed algorithm and hardware accelerator are very favourably compared with some of the existing solutions.
I. INTRODUCTION
Until recent discoveries of Convolutional Neural Networks and other Deep Learning architectures, Decision trees (DTs) had been recognized as one of the three most popular predicting models in machine learning field, together with Artificial Neural Networks and Support Vector Machines.
The decision tree predicting model was first presented in the literature more than 30 years ago [1] , while axis-parallel DTs were introduced only few years after [2] . Assuming that the classification problem is represented by a set of n attributes, Ai (i = 1, …, n), axis-parallel DTs in each tree node perform testing of a single attribute Ai from a test instance against the threshold ai: Ai > ai. Inducing axisparallel (also called orthogonal), DT assumes the selection of the attribute to be assigned and tested in each DT node (Ai) as well as the threshold value required for a comparison (ai). ID3 and C4.5, the two most commonly used algorithms for inducing axis-parallel DTs, are presented in [2] .
Manuscript received 26 November, 2018; accepted 30 June, 2019. This work was partially supported by Serbian Ministry of Education and Science (Project title: "Innovative electronic components and systems based on inorganic and organic technologies embedded in consumer goods and products", No. TR32016).
Although proposed quite long ago, axis-parallel DTs are still a topic of interest for the academic community [3] - [6] .
Oblique decision trees are the generalization of axisparallel DTs allowing multiple attribute testing in every DT node. As a result, oblique DTs are usually much smaller in size providing higher classifying accuracy when compared to axis-parallel DTs. In oblique DTs, this multivariate testing has a form, which is expressed as follows 1 1 0,
where ai, i = 1, …, n + 1 are called hyperplane coefficients. The most important oblique DT induction algorithms are CART, proposed in [1] , and OC1, which is presented in [7] . After the authors in [8] proved that finding the best oblique DT is a NP-complete problem, many oblique DT induction algorithms use some kind of heuristics in order to find suboptimal hyperplane coefficients [9] - [12] . The authors in [13] use HereBoy evolutionary algorithm [14] for solving this hard oblique DT induction problem. In our research, we use this approach as the starting point and modify it in order to support the induction of sparse oblique DTs.
Recently, a huge effort in machine learning community has been spent on the compression and size reduction of the prediction models, mainly Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs) [15] - [19] . The authors in [20] , [21] exploit the benefits of minimizing the number of non-zero hyperplane coefficients in oblique DTs as well. However, the focus in these papers is on the feature/attribute selection and the elimination of irrelevant or noisy features/attributes, while the percentage of non-zero elements after DT induction is not reported.
In this paper, we present SDTI algorithm for the induction of sparse DTs and SDTA hardware accelerator for sparse DTs. The proposed accelerator, intended for an implementation in FPGA, is particularly interesting for embedded and edge applications, where the storage capacity, memory consumption, and bandwidth as well as computing capabilities are severely constrained. As a result of the DT sparsification and increased number of zero-valued hyperplane coefficients, SDTA requires less memory for storing DT parameters and provides significant speedup by skipping zero-valued product terms in (1) and avoiding unnecessary operations.
The problem of hardware implementation of DTs has been in the focus of the research community for more than 15 years, resulting in a number of proposed architectures [22] - [28] . However, all previously proposed architectures are concerned with the acceleration of dense univariate, oblique or non-linear DTs, or ensembles composed out of dense DTs. To the best of the authors' knowledge, the hardware architecture for the acceleration of sparse DTs, presented in this paper, is the first of its kind.
The rest of the paper is organized as follows. In Section II, we present SDTI algorithm that induces sparse DTs by forcing high percentage of zero hyperplane coefficients in each DT node (1) without decreasing DT accuracy. In Section III, we introduce SDTA hardware accelerator for the sparse oblique decision trees intended for FPGA applications. The proposed hardware accelerator is going to benefit from the sparsity in DTs and provide faster classifications by computing operations only with non-zero operands. In Section IV, we give the experimental results of benchmarking our SDTI algorithm and SDTA accelerator performance using datasets from the UCI machine learning repository. Conclusions are given in Section V.
II. SPARSE DECISION TREE INDUCTION (SDTI) ALGORITHM
In this section, we present the application of HereBoy evolutionary algorithm for solving the sparse DT building problem. In our SDTI algorithm, iteratively for each node in DT, one by one hyperplane coefficients from (1) are masked and set to zero, while HereBoy algorithm is used to fine tune remaining non-masked coefficients in order to change and improve the position of the hyperplane. This incremental sparsification is repeated until the desired percentage of the hyperplane coefficients is set to zero. DT itself is built using the classical DT building algorithm with an iterative sparsification explained above and performed at each DT node.
In order to evaluate position of the hyperplane, during incremental sparsification procedure performed in each DT node, the fitness function from [13] is used  
where ld stands for a binary logarithm, log2(n). The symbols in proposed fitness function are:  k (number of classes of the classification problem);  N (number of training instances associated with the current node);  Ni (number of training instances that belong to the class i (from the total of N instances);
 N1i (number of instances that belong to the class i, placed above the current hyperplane).
In SDTI implementation, a binary encoding scheme, where HereBoy is working with a single chromosome similar to [13] , is also accepted. Given n attributes from a dataset, the chromosome consists of n + 1 values that encode coefficients ai from (1), where each coefficient is encoded with l bits (typical value for l being 20). Also, like in [13] , the initial chromosome for the evolutionary algorithm is generated in a way that corresponding hyperplane divides training set and ensures that at least one instance from the training set can be found on both sides of the hyperplane.
The basis for our SDTI algorithm is HBDT algorithm presented in [13] . Similar to HBDT algorithm, in every step, we use HereBoy algorithm to find the sub-optimal hyperplane, which splits instances from different classes in two disjoint regions. This is recursively repeated until the current region contains only the instances from a single class, which results in creating the leaf and setting the output class label to that particular class value. Modification of HBDT algorithm, introduced in our SDTI algorithm, is related to the incremental sparsification of a hyperplane in each node of DT. For each hyperplane obtained by evolutionary algorithm, the worst hyperplane coefficient is found, with the property that the fitness value calculated using (2) minimally decreases after the replacement of that coefficient with a zero value. The worst coefficient, when found, is masked and set to zero. This procedure is repeated with remaining non-masked hyperplane coefficients until the desired percentage of sparsification is reached. Algorithm 1. Sparse Decision Tree Induction (SDTI) algorithm.
SDTI (TI, sparsity) TI -Set of instances used to build a DT. Each instance is a vector of n numerical attributes plus the output class value sparsity -sparsification level for each DT node (percentage of hyperplane coefficients that will be masked and set to zero) → Create node root → If the values of output_class for all instances from the TI set belong to the same class, make root node a leaf, with output label matching the appropriate class value → Otherwise { → Find out the dominating class in the TI set, and set the class label for the node root to that value → Repeat { → Using the HereBoy algorithm and fitness function (2) find the optimal position of the dividing hyperplane → Find among non -masked hyperplane coefficients coefficient hc_worst that, when replaced with zero, minimally reduces fitness of the hyperplane, calculated by fitness function (2) → Mask and set to zero hc_worst → Calculate cs (current sparsity) as a percentage of zero elements within hyperplane coefficients } until cs is greater or equal to sparsity → Using the best hyperplane, divide the TI set into two subsets, one containing instances located above the hyperplane, TIabove and the other, containing instances located below the hyperplane, TIbelow. → Create a new branch descending from the node root, and create a sub-tree by calling the algorithm SDTI(TIabove, sparsity) → Create a new branch descending from the node root, and create a sub-tree by calling the algorithm SDTI(TIbelow, sparsity) } → Return root Since the iterative sparsification, explained above, is repeated for each DT node, the resulting DT should have exactly the same percentage of zero values per each DT node defined by the sparsity argument. As it is shown in section IV, for all 10 benchmarked datasets from the UCI repository, we are able to reach the sparsification percentage above 60 % (in some cases, even close to 80 %) without the loss in accuracy.
Similar to HBDT algorithm, in order to reach even better classification results, pruning of a DT at the end is performed. As a result, the size of the DT is decreased as a consequence of the redundant nodes removal. For this purpose, Prune_DT algorithm, reported in [13] , is used.
III. HARDWARE ACCELERATOR FOR SPARSE DTS
The architecture for hardware acceleration of sparse decision trees, called SDTA (Sparse Decision Tree Accelerator), is the evolution of SMpL architecture introduced in [24] . Let us consider the DT shown in the Fig. 1(a) . A straightforward approach to the hardware realization would be to implement the complete DT in hardware, which is originally presented in [22] . However, this approach is not the optimal because of the following key property of DTs. While the instance is being processed by a DT, only a subset of nodes from the complete DT is going to be visited. This is because each instance is taking exactly one path from the root node to one of the DT leaves. For example, if the current input instance will eventually be classified into the leaf node L4 from Fig. 1(a) , then it will be processed only by DT nodes 1, 3 and 4 while nodes 2 and 5 will not be active during the current instance processing. Therefore, to classify an instance, there is no reason to evaluate every node in the DT, but only a selected subset of nodes, one from each DT level, at most. SMpL architecture, originally presented in [24] , that evaluates only visited DT nodes during the classification of an instance, is more efficient than the architecture in [22] in terms of separate node modules that need to be implemented in hardware. For example, Fig. 1(b) shows the implementation of the DT from Fig. 1(a) based on the SMpL architecture. The SMpL architecture consists of three separate modules, called Universal Nodes. They can evaluate any DT node located on the same depth within a DT. We need three of these modules since the depth of the DT from the Fig. 1(a) is three. Operation of the SMpL architecture is pipelined, so the instance processing throughput is not influenced by the depth of implemented DT and is only dependent on the time required to evaluate a single DT node, which, in case of SMpL architecture, is proportional to the number of attributes, n, in the underlying classification problem DT was created to solve.
As shown in Section II, for any classification problem, instead of learning dense oblique DT, in which we would have to process all n attributes in every DT node, we can learn a sparse oblique DT, which processes only a fraction p of n problem attributes in each node. Using the sparse DT can be beneficial, particularly when instance processing speed is concerned since computation of each sparse DT node will be 1/p times faster compared to a dense DT node computation time. However, all previously proposed architectures for the DT acceleration are not able to benefit from this optimization opportunity.
The SDTA architecture that we are proposing is explicitly designed to take an advantage of such an optimization. SDTA architecture is closely related to SMpL architecture presented in [24] , but modified in order to be able to skip all unnecessary computations due to the fact that some hyperplane coefficients ai becomes equal to zero as a result of the sparsification introduced in each DT node.
The top-level architecture of SDTA sparse DT accelerator is presented in Fig. 2 . The SDTA sparse DT accelerator consists of M identical pipeline stages, where M is the maximum depth of a sparse DT that it can accelerate. Each pipeline stage is capable of evaluating sparsified tests from all nodes found at the same depth in the DT and consists of three major modules: attribute memory (module M1), Decision Tree or Pass Through (DTPT) node (module M2), and memory for storing relevant information about the nodes found at the same depth in the DT (module M3). Attribute memories from the pipeline stages create a chain, shifting the attribute values related to different instances that are being processed by the SDTA accelerator through the system. Each attribute memory is also connected to the corresponding DTPT node. DTPT nodes create a pipeline chain as well. Figure 3 presents the detailed architecture of one pipeline stage. Attribute memory, M1, is used to store attribute values for the current instance. Module M2 normally calculates the position of the instance relative to the hyperplane associated to the selected node while skipping all product terms from (1), where coefficients ai are equal to zero, and calculates the address of the node from the next level that should be visited next. This happens when the current instance has not been classified yet, i.e., module M2 operates in the "Decision Tree" mode. In case the current instance has already been classified by some previous pipeline stage, module M2 simply transfers the input class value to the next pipeline stage operating in the "Pass Through" mode. Module M3 stores the necessary information about all nodes from the same level in DT and consists of two memory units.
The first memory unit, SHCM, stores sparse hyperplane coefficients for all nodes. For each non-zero hyperplane coefficient ai, its value together with the attribute address increment is stored. The address increment is used to fetch the appropriate attribute from M1 memory, which should be multiplied with the current hyperplane coefficient. For example, if there is a total of 10 problem attributes and in the current DT node non-zero hyperplane coefficients are 1 st , 3 rd , 6 th , 7 th and 10 th coefficient, their corresponding attribute address increment values would be 0, 2, 3, 1, and 3, respectively.
The second memory unit stores the following data for each node associated to the current pipeline stage: number of non-zero hyperplane coefficients NNZ; base address Base_Addr, where the block of non-zero coefficients for the current node begins in the SHCM memory; addresses of left and right child nodes of the current node, AdrL and AdrR; class values for all child nodes that are actually leaves, ClassL and ClassR. In case a child node is not a leaf, its corresponding class value is set to zero.
During the evaluation of the DT node, module M2 calculates (1), but multiplying and adding only attribute/hyperplane coefficient product terms, for which the hyperplane coefficient ai is different from zero. All product terms, where a hyperplane coefficient ai is zero, are skipped effectively speeding up the hyperplane computation process.
Compared to SMpL architecture from [24] , the node processing time of the SDTA architecture is measured in clock cycles, equals: 
where n is the number of problem attributes and NNZ is the number of non-zero hyperplane coefficients, ai. Clearly, since NNZ < n, the node processing time of the SDTA architecture is always strictly shorter than the node processing time of the SMpL architecture. The amount of the node processing time reduction depends on the number of hyperplane coefficients that is set to zero during the process of inducing the sparse DT, using SDTI algorithm. The more coefficients are set to zero, i.e., the more DT node tests are sparsified, greater DT processing speedup can be reached when using SDTA instead of SMpL architecture.
Since the instance processing throughput of both SMpL and SDTA architectures directly depends on the individual node processing time, from previous discussion, it is obvious that the instance processing throughput of SDTA architecture is higher than the one of SMpL architecture.
The instance processing throughput of SMpL architecture is given by a following equation
The instance processing throughput of SDTA architecture is more difficult to calculate since it depends on the amount of sparsity present in DT nodes along the path that the current instance is taking through a DT while it is being processed. However, the worst case for instance processing throughput for SDTA architecture can be calculated as follows
In the special case, when all hyperplanes from a DT are sparsified by removing identical number of coefficients (not necessarily located at the same positions), it is easy to calculate the instance processing throughput for SDTA architecture also
In this case, we can easily calculate the expected speedup of SDTA over the existing SMpL architecture as follows
From equation (7), it can be concluded that the amount of the speedup that is reachable when using SDTA architecture directly depends on the amount of DT sparsification that is achievable during the DT induction phase.
IV. EXPERIMENTAL RESULTS
To be able to compare the performance of the SDTI algorithm with the HDBT algorithm, the following datasets from the UCI machine learning repository [29] are used: Glass Identification (gls), Vehicle Silhouettes (veh), Statlog Heart Disease (hrts), Hepatitis Domain (hep), Wine Recognition (wine), Pima Indians Diabetes (pid), Page Blocks Classification (page), Waveform 40 (wav40), Heart Disease Cleveland (hrtc), and Wisconsin Diagnostic Breast Cancer (wdbc).
The instances with missing values are removed from the dataset, while all reported results are the averages of the five ten-fold cross-validation experiments. This assumes the dividing the original dataset D into 10 non-overlapping subsets, D1, D2, …, D10, which consist of uniformly selected instances from D. During each cross-validation iteration, DT is built using the instances from D\Di set and tested on Di set (i = 1, ..., 10). By repeating this procedure 5 times, 50 DTs are constructed in total for each dataset. Then, average classification accuracy is calculated as the percentage of instances, which are correctly classified. Additionally, the average DT size is expressed as the average number of leaves. Both average classification accuracy and average DT size are calculated along with the corresponding 95 % confidence intervals. Similar to the approach from [13] , 30 % of the instances from the training set are selected randomly to build the pruning set since DT pruning algorithm requires a separate pruning set. Table I presents the results of experiments designed to compare the accuracy and the size of dense DTs created using HBDT algorithm [13] and sparse DTs with increasing sparsification value created using SDTI algorithm proposed in this paper. In Table I , for each UCI dataset, results of the HBDT algorithm correspond to the results of SDTI algorithm with the sparsity value of 0 %, since those two algorithms are identical when the sparsification is not applied. The SDTA architecture is designed and implemented using Xilinx Vivado Design Suite using default parameters for synthesis and implementation, while the experimental measurements are performed on Zynq Ultrascale+ MPSoC ZCU102 Evaluation Board [30] .
The first column of the Table I represents UCI dataset. For each dataset, multiple sparsification percentages are reported in the second column. The given percentage refers to the percentage of the zero hyperplane coefficients in each DT node after running the SDTI algorithm. The fact that all nodes use the same percentage of sparsification is convenient for pipelining in SDTA hardware accelerator due to the fact that each node requires exactly the same number of computations, maximizing throughput. The third column shows the number of attributes for each dataset, which directly determines the maximum number of the hyperplane coefficients in each DT node: without the sparsification, a number of hyperplane coefficients in each DT node equal to the number of attributes plus one according to (1) . The fourth column presents the average DT size, which is equal to the average number of leaves in the DT. Please note that the DT size is increased as a consequence of the sparsification in DT nodes. The DT size is larger as DTs become sparser.
The fifth column, Mem, shows the relative gain in terms of the memory required to store the complete DT calculated as the average number of internal nodes multiplied by the number of hyperplane coefficients. The negative percentages stand for the decreased memory requirements while positive represent the opposite. It is interesting to notice that, in some cases, even with high percentages of sparsification, there is no reduction of memory requirements for storing DTs, which is a consequence of larger DTs after the sparsification. However, in most cases of the test, memory requirements are decreased after the sparsification. The column Spd presents the speedup in the classification throughput expressed with respect to the DT with the sparsification 0 %. This is an effective comparison between the achievable instance classification throughput when DT is accelerated by SMpL dense DT accelerator presented in [24] and the SDTA sparse DT accelerator presented in this paper. The speedup is calculated using the percentage of the sparsification presented in the second column of the Table I according to (7) . Higher the sparsification percentage results in a reduced number of the required computations in each pipeline stage, and higher is the speedup of the SDTA architecture compared to the SMpL architecture.
Finally, the last column of the Table I shows the averaged accuracies of the DTs built for different sparsification percentages and different UCI datasets. In the conducted tests, only sparsifications resulting in accuracy drop lower than 1 % are accepted compared to the accuracy of a nonsparse DT for the same dataset. The highlighted rows mark the highest sparsification percentages achieved with the acceptable accuracy drop for each given dataset.
As it can be observed from the Table I , it is possible to achieve the significant sparsification DT levels for each of used UCI datasets without degrading the original, dense DT accuracy. For all presented UCI datasets, the sparsification levels greater than 60 % can be achieved. For some of them, the sparsification levels reach the value of almost 80 %.
As for the DT storage memory requirements, it can be observed that for most of UCI datasets used a significant reduction in memory size can be achieved compared to the memory size required to store a dense DT. The amount of the memory size reduction reaches up to 60 %. For some datasets and sparsification levels, the output DTs are significantly deeper when compared to the dense DTs resulting in slightly increased requirements for the storing hyperplane coefficients, even after eliminating majority of them during the SDTI algorithm run.
However, if we analyse the throughput speedup from the Table I , it is clear that for all datasets used the DT sparsification is helping in improving the instance classification throughput. The instance processing speedup, when using sparse DTs over traditional dense DTs, ranges from 2.75 up to 4.67 times, which is a significant improvement.
V. CONCLUSIONS
In this paper, a novel algorithm for inducing sparse DTs, SDTI, and hardware accelerator architecture for accelerating sparse DTs, SDTA, are presented. Using sparse DTs over standard dense oblique DTs can be beneficial, since sparse DTs usually require less memory resources for storing their parameters and can be processed faster when compared to dense oblique DTs.
To validate the performance of SDTI, sparse DT building algorithm and SDTA, sparse DT hardware accelerator, a set of experiments using standard UCI machine learning repository datasets are used. Results of experiments seem to indicate that sparse DTs usually require significantly less memory resources, up to 60.4 % less storage capacity compared to dense oblique DTs, without any loss in DT accuracy.
Speedup experiments also indicate that working with the sparse DT hardware accelerator results in the instance processing speedup of up to 4.67 times compared to previously proposed dense DT hardware accelerator.
