Abstract. Decision trees and decision tree ensembles are popular machine learning methods, used for classification and regression. In this paper, an FPGA implementation of decision trees and tree ensembles for letter and digit recognition in Vivado High-Level Synthesis is presented. Two publicly available datasets were used at both training and testing stages. Different optimizations for tree code and tree node layout in memory are considered. Classification accuracy, throughput and resource usage for different training algorithms, tree depths and ensemble sizes are discussed. The correctness of the module's operation was verified using C/RTL cosimulation and on a Zynq-7000 SoC device, using Xillybus IP core for data transfer between the processing system and the programmable logic.
Introduction
Decision trees and decision tree ensembles have been successfully applied to a wide range of tasks, from body part recognition [13] to intrusion detection [16] and packet filtering in computer networks [11] . This versatility, combined with fast learning and classification makes them the first-choice algorithm for many problems [2, 6] . Because of the simplicity of the model, decision trees and tree ensembles can be easily understood by humans (so-called white-box model).
The structure of an ensemble of decision trees is presented in Fig. 1 . Each non-trivial decision tree has a single root node, which is always a split node. Each split node has two child nodes, which can be either leaf nodes or subsequent splits. In each split node, a single condition is evaluated -if it is true, then the sample goes in the left subtree; otherwise it goes in the right subtree. Depending on the type of the attribute on which the split is performed, the condition can be a numerical comparison or a test of item's membership in a set. Finally, leaf nodes have values that correspond to class labels in case of classification and estimate values in case of regression.
Random forests were introduced in 2001 by L. Breiman in his paper [3] . They are based on bagging technique and random selection of features [9, 10] . Random forests consist of multiple decision trees, trained according to the ... Structure of a random forest consisting of n decision trees. Nodes were numbered using depth-first pre-order tree traversal algorithm algorithm briefly described in section 2. In case of classification, the final class label is the one that received the most votes from individual trees; averaging is used in regression forests. Using multiple trees does not increase bias, while it decreases variance. In comparison to single decision trees, random forests require smaller depth to achieve comparable classification or regression accuracy.
Random forests can be effectively parallelized. In recent years, many parallel implementations of both random forest learning and prediction were presented, targeted at multicore CPUs, GPGPUs and FPGAs [11, 13, 14, 15, 17] . Each sample from input data set is classified independently of others, making it possible to exploit datalevel parallelism by processing different samples by the same classification/regression tree simultaneously. In random forests, samples are independently classified by all trees, creating another opportunity for parallelization. As already mentioned, trees in random forests require fewer levels to achieve acceptable performance, therefore it is possible to reduce processing time of a single sample in such implementations.
Tree and ensemble training algorithms
Code generation for decision trees and tree ensembles presented in this paper is based upon the assumption that a trained ensemble of decision trees is available. Machine learning module of the OpenCV library [2] was used to provide data structures representing individual trees and tree ensembles together with following training routines:
• In standard single decision tree training [4] , one attribute and class mask or threshold that best divide data into classes is found at each level of the tree. The procedure is performed recursively and is stopped when: 1) the maximal tree depth is reached,
2) no change in predictive power can be achieved by adding further splits, 3) all samples at the node belong to the same class or variation is smaller than a specified threshold, 4) the number of samples at the node is too small.
• Random forest training builds a forest with a specified number of trees [3] . Maximal depth of trees is also specified. Each tree in the forest is trained using different, randomly selected dataset containing the same number of samples. At a given node, random subset of features is used to find the best split.
• Extremely randomized forests are trained using an algorithm similar to that for random forests [8] . However, there are two important differences. Firstly, all trees are trained using the same dataset. Secondly, attributes and threshold for split nodes are selected randomly.
Recognition rates for different ensemble configurations and training algorithms for used datasets are given in section 5.
Design process in Vivado HLS
First high-level synthesis tools were developed for experimental and research purposes from the 1970s to the 1990s [5] . They mainly used custom languages and be- 
Design simulation and verification
Before the code is synthesized, C simulation is usually used to verify its operation at the behavioral level. This is done by comparing the output returned from the function with a golden reference, be it a precomputed set of values read from a file or the output from the synthesized function before changes were applied to it in order to ensure synthesizability. In Vivado HLS, test benches are C/C++ programs and the result of the simulation is signalled by the returned value.
After the C simulation is successfully finished and the code is synthesized, C/RTL cosimulation is used to ver- 
Performance and resource usage analysis
Apart from latency, each module is characterized by initiation interval, which specifies how often the module is 
Related work
Parallel implementations of random forests for URL classification on multicore CPU, GPGPU and FPGA are presented in [17] . Publicly available dataset is used by the au- 
Used datasets
Two publicly available datasets [1] for character recognition were used to verify the correctness and performance of decision tree ensembles implementation. In case of all the datasets the classification task is considered, i. e. a class label (a letter or a digit) is assigned to a sample based on the value of numerical attributes.
For optical recognition of handwritten digits, dataset [12] Tree nodes are represented by TreeNode structure instances, which consists of following fields:
• Type of the node (all nodes) -2 bits are required to encode the type of the node: split or leaf and categorical or numerical.
• Attribute index (split nodes) -used to determine on which attribute the split is performed. Required number of bits for this field is equal to log 2 n , where n is the number of attributes per sample.
• Right node offset (split nodes) -this field occupies the number of bits equal to the depth of the tree when nodes are arranged using depth-first pre-order algorithm.
• Node value (all nodes) -in case of categorical splits or leaves it contains a class mask or a class label, respectively. For ordered nodes, this fields stores an estimated value or a threshold.
The ensemble function from Alg. 1 is pipelined with the lowest possible initiation interval to increase the throughput. The code that implements individual decision trees is placed within a for loop -maximal number of iterations needed to reach a leaf node is equal to the number of levels in the tree. However, there is a possibility that it will be reached sooner -such case is handled by the break statement, preceded by the incrementation of the vote counter for the corresponding label. • When a single array storing attributes is shared among all trees, the throughput of the module is reduced due to the limited number of read/write ports. Results of described optimizations are presented in section 9.
R. Kułaga, M. Gorgoń
# d e f i n e TREE_DEPTH 10 # d e f i n e ATTRIBUTES_PER_SAMPLE 64 s t r u c t TreeNode { a p _ u i n t <2> t y p e ; a p _ u i n t <6> a t t r i b u t e I d x ; a p _ u i n t <9> r i g h t O f f s e t ; a p _ u i n t <32> v a l u e ; } ; 
i n t 3 2 _ t a t t r i b u t e T e m p = a t t r i b u t e s [ c u r r e n t N o d e . a t t r i b u t e I d x ] ;
f l o a t a t t r i b u t e = * r e i n t e r p r e t _ c a s t < f l o a t * >(& a t t r i b u t e T e m p ) ; f l o a t s p l i t = * r e i n t e r p r e t _ c a s t / / c o d e f o r r e m a i n i n g t r e e s 1 . . . 7 i s p l a c e d h e r e a p _ u i n t <4> w i n n e r C l a s s = 0 ; a p _ u i n t <4> w i n n e r V o t e s = 0 ; f o r ( u i n t 8 _ t i = 0 ; i < 1 0 ; i ++) { a p _ u i n t <4> c u r r e n t V o t e s = v o t e s [ i ] ; i f ( c u r r e n t V o t e s > w i n n e r V o t e s ) { w i n n e r C l a s s = i ; w i n n e r V o t e s = c u r r e n t V o t e s ; } } * o u t ++ = w i n n e r C l a s s ; } Alg. 1: Basic C++ implementation of an ensemble of 8 decision trees for letter and digit recognition with single attribute memory and floating-point attributes is at level i = 0), therefore the total number of nodes is smaller or equal to
i , where N -number of tree levels. However, the amount of available block RAM is usually a limiting factor only in case of low-throughput implementations of decision trees. In other cases, the number of available LUTs is usually more limiting.
Distributed memory can be instantiated in blocks of specified size, whereas BRAM is available in blocks of fixed size. Therefore, to ensure efficient usage of resources, distributed memory should be used for trees with small number of nodes. When ensembles with considerable amount of nodes are implemented in reconfigurable logic, external static or dynamic memory can be used. In such cases care must be taken when defining the layout of nodes in memory in order to avoid frequent memory page misses, which can severely reduce performance of the implementation [13] .
Verification of results
Verification of letter and digit recognition was performed on three levels, according to the typical workflow for the Vivado HLS environment described in more details in the section 3.
1. In C simulation, recognition results obtained using
OpenCV's implementations of decision trees, random forests and extremely randomized forests for the test set were saved to a file, then read in a test bench and used as a golden reference.
2. C/RTL cosimulation was performed directly after the C simulation was completed to ensure correct operation of the synthesized RTL description. 
Resource usage
Resource usage and performance parameters for different ensemble configurations for digit recognition are presented in Tab. 5. It is notable that utilization estimates provided by Vivado HLS were not accurate in most cases -LUT and FF usages from the C synthesis reports were considerably higher than the actual usages reported after Verilog RTL implementation. Moreover, after some optimizations the estimated resource utilization increased, yet the implementation revealed that it was actually lower. It is worth mentioning that both C synthesis and RTL implementation took considerable amounts of time (in some cases reaching a few hours) on a workstation with Intel i7-4770K CPU.
The highest resource utilization and classification throughput is achieved when wide sample input is used.
It is notable that in most systems this design will be I/O It is notable that all tested software implementations were single threaded, but decision tree ensembles can be effectively implemented as multithreaded programs running on multicore processors [17] .
Conclusions
In this paper, implementations of decision trees and tree ensembles for letter and digit recognition on Xilinx reconfigurable devices supported by Vivado High-Level Synthesis were presented. They can be used either as coprocessors for ARM and MicroBlaze processors or as a part of image processing and recognition pipeline implemented fully in hardware.
In contrast to software implementation, when tree ensembles are implemented in hardware they are subject to much stricter restrictions -both when it comes to maxi- 
