Random forest training on reconfigurable hardware by Cheng, Chuan
Imperial College London
Department of Electrical and Electronic Engineering
Random Forest Training on Reconfigurable
Hardware
Chuan Cheng
Submitted in part fulfilment of the requirements for the degree of
Doctor of Philosophy in Electrical Engineering Research of
Imperial College London
and the Diploma of Imperial College London, July 2015

Abstract
Random Forest (RF) is one of the most widely used supervised learning methods available. An
RF is ensemble of decision tree classifiers with injection of several sources of randomness. It
demonstrates a set of improvement over single decision and regression trees and is comparable
or superior to major classification tools such as support vector machine (SVM) and adaptive
boosting (Adaboost) with respect to accuracy, interpretability, robustness and processing speed.
RF can be generally divided into training process and predicting process.
Recently with emergence of large-scale data mining applications, the RF training process im-
plemented in software on a single computer can no longer induce a complex RF model within
reasonable amount of time. Alternative solutions involving computer clusters and GPUs usu-
ally come with disadvantages with respect to Performance/Power ratio and are not feasible for
portable/embedded applications.
In this work a set of FPGA-based implementations of the RF training process are proposed.
FPGA devices allow construction of efficient custom hardware architectures and feature lower
power consumption than typical GPPs or GPUs therefore are suitable for portable/embedded
applications. The proposed hardware training architectures take advantage of different types
of inherent parallelism in the RF training algorithm and distribute the workload to a set of
parallel workers. Combining the parallel processing techniques with custom hardware designs
featuring low latency, the architectures are able to accelerate the training process without loss
in accuracy.
i
ii
Copyright Declaration
The copyright of this thesis rests with the author and is made available under a Creative
Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy,
distribute or transmit the thesis on the condition that they attribute it, that they do not use it
for commercial purposes and that they do not alter, transform or build upon it. For any reuse
or redistribution, researchers must make clear to others the licence terms of this work
iii
iv
Declaration
I herewith certify that the work presented in this thesis is my own work. All material in this
thesis which is not my own work has been properly acknowledged.
Chuan Cheng
v
vi
Acknowledgements
I would like to express my gratitude to my supervisor Dr. Christos Bouganis for his professional
advices, encouragement and patience.
I also would like to thank my lovely wife, Liting, for her understanding and support during my
study.
vii
viii
Contents
Abstract i
Copyright Declaration iii
Declaration v
Acknowledgements vii
1 Introduction 1
1.1 Objectives and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Background Knowledge on RF Training Process . . . . . . . . . . . . . . . . . . 5
2.1.1 Generic Decision Tree Training Algorithm . . . . . . . . . . . . . . . . . 6
2.1.2 RF Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Very Fast Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
ix
x CONTENTS
2.2 Related Works on Accelerating RF/DT Training Process . . . . . . . . . . . . . 14
2.2.1 GPP platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 GPU platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 FPGA platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Hardware Framework and Task Parallelism Based Architecture 21
3.1 Hardware Framework for RF Training Process . . . . . . . . . . . . . . . . . . . 22
3.1.1 Overview of Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.2 Top-level Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.3 Memory Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.4 Sorter Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.5 Split Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Task Parallelism Based Architecture . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Overview of Task Parallelism Scheme in Hardware . . . . . . . . . . . . . 36
3.2.2 Task Parallelism Based Task Allocation Module . . . . . . . . . . . . . . 36
3.2.3 Task Parallelism Based Results Collection Module . . . . . . . . . . . . . 38
3.2.4 Connection between Index FIFO and Memory Array . . . . . . . . . . . 38
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Training Accuracy of the Framework . . . . . . . . . . . . . . . . . . . . 40
3.3.2 Evaluation of Task Parallelism Based Architecture . . . . . . . . . . . . . 42
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
CONTENTS xi
4 Enhancement to the Framework and Data Parallelism Based Architecture 45
4.1 Enhancement to the Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.1 Adjustable Parameter mtry . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.2 Support for Multiple-class (> 2) Classification . . . . . . . . . . . . . . . 49
4.1.3 Training Data Compression Scheme . . . . . . . . . . . . . . . . . . . . . 52
4.1.4 Evaluation of Data Compression Scheme . . . . . . . . . . . . . . . . . . 56
4.2 Data Parallelism Based Architecture . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.1 Overview of Data Parallelism Scheme in Hardware . . . . . . . . . . . . 58
4.2.2 Data Parallelism Based Task Allocation Module . . . . . . . . . . . . . . 58
4.2.3 Data Parallelism Based Results Collection Module . . . . . . . . . . . . . 61
4.2.4 Evaluation of Data Parallelism Based Architecture . . . . . . . . . . . . 62
4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5 Incremental Training Architecture 65
5.1 Incorporation of the Hoeffding Tree Algorithm in the RF Training . . . . . . . . 66
5.2 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2.2 Mapping in External Memory and EMA . . . . . . . . . . . . . . . . . . 69
5.2.3 Hoeffding Bound Measurement . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.4 Results Storage Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.5 Predicting Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.2 Training Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6 Conclusion 83
6.1 Summary of Thesis Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Bibliography 86
xii
List of Tables
2.1 Summary of the previous works on implementing RF/DT training process . . . 16
2.2 Difference among various FPGA-based implementations . . . . . . . . . . . . . . 19
3.1 Property of the training datasets under test . . . . . . . . . . . . . . . . . . . . 41
3.2 Comparison of training accuracy (OOB error) . . . . . . . . . . . . . . . . . . . 41
3.3 Hardware utilisation of task parallelism based architecture . . . . . . . . . . . . 42
3.4 Training time for task parallel based architecture . . . . . . . . . . . . . . . . . 44
4.1 Comparison of memory usage (bits per attribute per example) . . . . . . . . . . 56
4.2 Hardware utilisation of data parallelism based architecture . . . . . . . . . . . . 62
4.3 Training time for data parallelism based architecture . . . . . . . . . . . . . . . 63
5.1 Summary of architecture configuration . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Hardware utilisation of training module . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Hardware utilisation of results storage . . . . . . . . . . . . . . . . . . . . . . . 77
5.4 Hardware utilisation of predicting module . . . . . . . . . . . . . . . . . . . . . 77
5.5 Properties of covertype dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
xiii
5.6 Comparison results for training speed of incremental architecture . . . . . . . . 79
xiv
List of Figures
2.1 General architecture of a decision tree . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 General architecture of Random Forest . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Illustration of how a decision tree grows . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Breadth-first vs. Depth-first training . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Illustration of VFDT algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Workflow of DT training process . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Top-level architecture of training framework . . . . . . . . . . . . . . . . . . . . 24
3.3 Architecture of memory array . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Architecture of FIFO-based merge sorter . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Two virtual FIFOs in a single RAM block . . . . . . . . . . . . . . . . . . . . . 30
3.6 Example of buffering new input during a merge . . . . . . . . . . . . . . . . . . 31
3.7 Architecture of split module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.8 Task parallelism based workflow of DT training process . . . . . . . . . . . . . . 35
3.9 Architecture task parallelism based task allocation module . . . . . . . . . . . . 37
3.10 Architecture of task parallelism based results collection module . . . . . . . . . . 38
xv
3.11 Architecture of multiplexing unit . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.12 Architecture of test environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Workflow of DT training process with mtry > 1 . . . . . . . . . . . . . . . . . . 47
4.2 Enhanced top-level architecture of training framework . . . . . . . . . . . . . . . 48
4.3 Improved split quality measurement in hardware . . . . . . . . . . . . . . . . . . 51
4.4 A training dataset of five examples mapped in feature space . . . . . . . . . . . 52
4.5 Architecture of the compression module incorporated into the training framework 53
4.6 Hardware implementation of the compression scheme . . . . . . . . . . . . . . . 54
4.7 Mechanism linking compressed data i to real value data di . . . . . . . . . . . . 55
4.8 Vertical data parallelism based workflow of DT training process . . . . . . . . . 59
4.9 Architecture of data parallelism based task allocation module . . . . . . . . . . 60
4.10 Architecture of data parallelism based results collection module . . . . . . . . . 61
5.1 Incremental training workflow in hardware . . . . . . . . . . . . . . . . . . . . . 66
5.2 Top-level architecture of incremental training architecture . . . . . . . . . . . . . 67
5.3 Mapping of pointer groups in external memory . . . . . . . . . . . . . . . . . . . 69
5.4 Hardware implementation of Hoeffding bound measurement . . . . . . . . . . . 70
5.5 Illustration of parsing training results . . . . . . . . . . . . . . . . . . . . . . . . 71
5.6 Hardware implementation of results storage module . . . . . . . . . . . . . . . . 73
5.7 Hardware implementation of predicting module . . . . . . . . . . . . . . . . . . 74
5.8 External memory access with respect to the number of predicting units . . . . . 78
xvi
5.9 Typical static power consumption of Stratix IV EP4SE680 FPGA . . . . . . . . 80
xvii
xviii
Chapter 1
Introduction
Random Forest (RF) [6] is an ensemble of decision tree classifiers, it is one of the most widely
used supervised learning tools available. RF can be used for both classification and regression.
In various empirical testings, it demonstrates significant improvement in generalisation error
over single decision and regression trees by alleviating the overfitting problem and features a
fast and robust performance of learning especially for data with noise and missing information.
It is comparable or superior to major classification tools including support vector machine
(SVM) and adaptive boosting (Adaboost) in many classification applications [6, 26, 54, 55, 47].
Its relatively high degree of interpretability also makes it popular in the field of data mining
and it has become one of the standard tools for research in bioinformatics.
The training process of RF, compared to the predicting process, is data intensive and can take
hours or even days to induce a high quality model from a training dataset of large size when
combined with a tuning process that requires repeating the training many times. In the past
for many applications the training is a one-off task with relatively small training dataset hence
the training time is not of great concern. More recently, with the emergence of large-scale
data mining, even for a one-off training task the sequential training programs implemented in
software can no longer induce an RF model within a reasonable amount of time. For instance,
training the RF-based pose recognition model involved in [8] requires more than 300 CPU-days
on a single GPP workstation. To accomplish the training process within reasonable time, a
1
2 Chapter 1. Introduction
235-machines cluster was built to accelerate the process.
Therefore a high-speed training solution is becoming necessary in order to improve the produc-
tivity. This demand is further fuelled by the potential applications where dynamic information
is involved, so that the RF model needs to be updated within a short period of time. As
a result different approaches have been explored to accelerate the training process, including
optimised implementations for both CPU and GPU devices. Despite the fact that FPGAs
have been demonstrated to be ideal hardware platforms for accelerating many computational
intensive and data intensive applications [25, 2, 40], little effort has been made to explore the
FPGA architectures that are optimised for RF training process, or its fundamental equivalent,
the decision tree induction process. FPGAs, with its flexible logic fabrics and I/Os as well as
ultra-high on-chip memory bandwidth, have the potential to efficiently process the category
of computing problems that the RF training process belongs to, i.e. computing with large
number of memory access and large degree of data dependence plus an inherent parallelism.
In addition research shows that FPGA-based implementations are superior to CPU and GPU
based designs in terms of speed up per unit of energy [52], making FPGAs the ideal platform
for low power applications. Intuitively the motivation behind this work is to investigate how
well an FPGA-based implementation can perform in accelerating the RF training process by
proposing novel hardware architectures that exploit the features in FPGA devices.
FPGA-based RF training capability would enable a series of new applications. The low power
consumption of FPGAs is particularly suitable for mobile platforms such as robots and portable
personal devices. Having local training capability means that a device can collect local training
examples and update the predicting model without large amount of data communication with a
remote facility. One potential application is object detection in space mission. The device can
carry an initial predicting model at first, later on new negative examples (background images)
can be collected locally in order to adapt to changing environment while new positive examples
(items to be detected) can be collected by using images only with high confidence or sending
these images back for manual verification. In this case, it is not necessary to send all training
examples back to earth and receive new predicting model from there.
1.1. Objectives and Contributions 3
1.1 Objectives and Contributions
Given the potentials for FPGAs to make a promising hardware platform for accelerating the
RF training process, it is essential to understand what characteristics are available in the RF
training algorithm that can be exploited in order to take advantage of the hardware. Meanwhile
it is also important to understand what hardware architecture will serve the algorithm best and
yield the best performance efficiently and what limitations existing in FPGAs that may prevent
the architecture from performing well. These three major questions will be investigated in the
following chapters. In this work the focus is put on RF model trained for classification purpose.
As a result the two main contributions of this work are:
1. Identification of the characteristics in the RF training algorithm that can be optimised
for FPGA implementations.
2. Introduction of three FPGA-based architectures which achieve superior or comparable
training speed when compared to CPU and GPU implementations. The architectures are
significantly more efficient in terms of Performance/Power consumption ratio, therefore
are suitable for embedded/portable applications
1.2 Thesis Outline
In the first half of Chapter 2 background knowledge about the algorithms involved in the work
is introduced in detail. In the second half, a literature review is given on the related works that
have be done for accelerating the RF training process in various hardware platforms.
In Chapter 3 an FPGA architecture optimised for Task Parallelism is introduced. The architec-
ture exploits some of the inherent parallelism in the training algorithm in which the workload
involved at the same level of a decision tree can be processed independently. In addition to
the hardware design that enables this specific parallelism, a general hardware framework that
implements the RF training process is proposed. That includes a cycled workflow which im-
plements the Breadth First training strategy. Also included is a set of fundamental hardware
4 Chapter 1. Introduction
components that implement the essential steps in the training process. The training archi-
tectures introduced in other chapters will be based on the above framework as well as the
fundamental components.
In Chapter 4 a second FPGA architecture optimised for Data Parallelism is introduced. Again
the architecture takes advantage of another inherent parallelism in the training algorithm in
which different attributes of a partition of training data at a decision tree node can be assessed
in parallel. Apart from that, three hardware enhancements are added to the general hardware
framework, which bring improvements in terms of functionality.
in Chapter 5 The limitation in the size of on-chip memories in FPGAs in the context of acceler-
ating the RF training process is addressed. The approach is to turn the previous batch training
strategy into incremental training. A third training architecture therefore is introduced, incor-
porating VFDT [20], a decision tree training algorithm that targets streaming data. As a part
of VFDT algorithm, a novel hardware architecture that implements the RF predicting process
is also proposed.
Finally Chapter 6 concludes the thesis and introduces the future work that targets the problems
yet to be solved.
1.3 Publications
Chuan Cheng and Christos-Savvas Bouganis. Accelerating Random Forest training process
using FPGA. In Field Programmable Logic and Applications (FPL), 2013 23rd International
Conference on, pages 1-7. IEEE, 2013.
Chuan Cheng and Christos-Savvas Bouganis. Memory optimisation for hardware induction
of axis-parallel decision tree. In ReConFigurable Computing and FPGAs (ReConFig), 2014
International Conference on, pages 1-5. IEEE, 2014.
Chapter 2
Background
The chapter is divided into two parts. In the first part the background knowledge of RF training
process and the algorithms directly involved in the work will be introduced in detail. This is
followed by a review on the previous works on the topic.
2.1 Background Knowledge on RF Training Process
Random Forest (RF) is an ensemble of decision tree (DT) classifiers trained on random samples
of a training dataset χtrain = {(xm, ym),m = 1, 2...n}. Each pair (x, y) in the dataset is a
training instance containing a vector of attribute xtrain = (x1, x2, ..., xd) and a label attached y.
A typical DT is shown in Figure. 2.1. To predict the label of an unforeseen input vector xpredict,
starting from the root node, the input is passed to one of the leaf nodes through a chain of
non-leaf nodes. Each non-leaf node contains an indicator function that is applied to the input
vector to determine which node to go next. On reaching a leaf node, the label contained in
the node is then the prediction result for the input. The structure of the tree, the indicator
functions as well as the labels contained in the nodes are the elements that define a unique DT
classifier. These elements are induced during a single DT training process. RF combines the
DTs by using majority voting with one vote per tree over all trees in the ensemble. A general
architecture of RF is shown in Figure. 2.2 where for an input vector xpredict, each DT in the
5
6 Chapter 2. Background
Root node
Non-leaf node
Leaf node
Figure 2.1: General architecture of a decision tree
Tree #1 Tree #2
...
Tree #n
x
Voting
y
Figure 2.2: General architecture of Random Forest
ensemble produce a local label and the prediction result y is then determined by the majority.
Intuitively the RF training process can be decomposed into a set of single DT training process.
2.1.1 Generic Decision Tree Training Algorithm
A DT can be induced by many training methods [7, 42] which share the same fundamental
framework i.e. splitting the training dataset recursively into many partitions in the feature
space. Figure. 2.3 illustrates how a DT is grown. The figure uses as example a two-dimensional
training dataset (i.e. training data with two attributes) with two different labels. The data
instances are mapped to a 2-D feature space. The training begins with a root node that
represents a data partition containing all the training data (a). Then a split criterion in the
feature space is determined to separate the training data into two partitions, which generates
two new nodes in the DT (b). After that the training data contained in each partition is further
2.1. Background Knowledge on RF Training Process 7
(a)
(a)
(b)
(b)
(c)
(c)
Figure 2.3: Illustration of how a decision tree grows
split into smaller partitions until a stopping criteria is met (c), for instance the data remaining
in a partition all have the same label or the minimum number of the training instances required
for further split is reached. The tree nodes that can no longer be split are leaf-nodes (nodes
with colour in the figure), each leaf-node is assigned a label which is determined by the majority
label in the partition. On the other hand those can be split are non-leaf nodes. The information
that defines a split criterion is stored in each non-leaf node.
A split criterion is also referred to as weak learner. The optimal weak learner for each non-leaf
node is determined by the optimisation of an objective function I:
θi = argθmaxI(Si, θ) (2.1)
where θi is the optimal weak leaner selected out of all candidate weak learners θ at tree node i
and Si is the partition of training data contained in the node i. A candidate θ is defined by:
θ = (φ(v), τ), v ∈ Si (2.2)
where function φ and threshold τ form the indicator function [φ(v) ≥ τ ] that split the partition
Si into two branches. Variable v is a set of attribute values extracted from the vector xtrain
in each instance contained in Si. φ is defined by various types of weak learner. In this work
we target the axes-parallel weak learner which is the most widely adopted type in various RF
8 Chapter 2. Background
tools. The weak learner is defined as:
φ(v) = xa (2.3)
where φ(v) is simply equal to the value of attribute a extracted from the vector xtrain. In the
feature space an axes-parallel weak learner acts as a hyperplane (as lines in 2-D space as shown
in Figure. 2.3). A candidate weak learner θ is produced by first selecting one attribute, a. Then
a list of values {x1,a, x2,a, ..., xm,a} can be extracted from the partition Si, in which xm,a refers
to the value of attribute a of the m-th vector xtrain in the partition. A threshold τ is then
determined by taking median of any two adjacent and non-identical values after sorting the
elements in the list. By selecting different attributes and thresholds, a group of candidate weak
learners can be produced for each partition Si. Other types of weak learner include oblique
weak learner [17, 36] that is defined as:
φ(v) =
d∑
i=1
aixi + ad+1 (2.4)
where ai is the coefficient for each attribute value xi in the vector xtrain with ad+1 being a
constant. The axes-parallel learner can be considered as a special case of the oblique weak
learner. More complex, non-linear alternatives [11, 21] are also ele. It is observed that for
certain applications, RF based on oblique or non-linear weak learners can achieve comparable
or higher accuracy when compared with the axes-parallel variant. However with significant
increase in processing complexity and a lack of general superiority in accuracy, these two weak
learners have drawn less attention than the simple but robust axes-parallel weak learner.
The objective function I used to determine the optimal weak learner is defined as:
I = i(N)− PLi(NL)− PRi(NR) (2.5)
where N is the partition contained in a non-leaf node. NL and NR represent the left and right
partitions derived from N . PL and PR are the proportion of training instances reaching the
left and right branches respectively. i(·) is the impurity measurement which is also method-
dependent. In this work we target Gini impurity-based measurement. The measurement is
2.1. Background Knowledge on RF Training Process 9
1
2 3
4 5 6 7
10 1198
12 13
Breadth first
1
2 9
3 4 10 11
12 1385
6 7
Depth first
Figure 2.4: Breadth-first vs. Depth-first training
defined as:
i(N) =
∑
i 6=j
PiPj =
1
2
[
1−
∑
j
P 2j
]
(2.6)
where Pi and Pj refer to the proportion of training instances with class i and j in the partition
N respectively. Another commonly used impurity measurement is Entropy impurity-based [42]
which is defined as:
i(N) = −
∑
j
Pj log2Pj (2.7)
where Pj is the proportion of training instances with class j in the partition N . Both measure-
ments are widely used and demonstrate comparable performance in accuracy [51]. In this work
Gini impurity is adopted since the same measurement is also used in the RF algorithm.
Breadth-first vs Depth-first
The nodes in a DT can be induced in two different orders. As shown in Figure. 2.4 in Breadth-
first training the algorithm attempts to split all non-leaf nodes at the same level before moving
on to the next level. While in Depth-first training the child nodes of a parent non-leaf node are
always processed with priority until a child node can no longer be split. The two approaches
have different properties leading to different implementation designs. Breadth-first approach
allows multiple nodes to be processed in parallel since all the nodes at the same level can be
independently split. A typical design is having a set of parallel workers, each being assigned one
node. However given a large number of parallel workers, at early stages of the training, there
will not be sufficient nodes available for filling all the workers. As a result utilisation of the
10 Chapter 2. Background
processing power remain at low level until more tree nodes are produced. In addition, breadth-
first approach requires enough memory space to store information for all the nodes at the same
level. As the training process reaches deeper level, the memory space requirement can become
a limitation for memory-limited devices. On the other side, depth-first approach produces far
less nodes that are ready to be split at any time during the training process. The memory space
needed to store the information of the nodes is significantly lower than breadth-first approach.
However this also means it is not suitable for parallel processing of different nodes since there
is not enough workload available at a time. Parallel processing is still possible within a single
node given that the workload involved in the search for the optimal weak learner can be split
into independent batches with certain communication overhead. This is especially the case in
the early stages of the training when the size of the training data partition in a tree node is
large and the workload involved is proportional to the partition size. As the training continues,
however, at certain point the size of the partition in the nodes will be too small to fill all the
parallel workers available.
2.1.2 RF Training Algorithm
The RF training algorithm is built upon the single DT training algorithm with modifications.
Firstly, RF introduces two sources of randomness to the DT training process in order to reduce
the correlation among the DTs in the ensemble. One source is that each DT is trained on
a random subset of the original training dataset (by sampling with replacement, typically
with the same size as the original dataset). The second source is that, only a subset of the
attributes available is randomly selected at each node in producing candidate weak learners.
This is different from conventional DT training methods in which all attributes available will be
considered when search for the optimal weak learner. Secondly, all the DTs in the RF ensemble
are fully grown without pruning. Pruning refers to a group of techniques used to limit the size
of a DT by removing a part of its tree nodes. They are used as a standard practise in training a
single DT classifier to avoid overfitting problem by slightly increasing the bias of the classifier.
In the RF training process, however, the overfitting problem is suppressed by introduction of
2.1. Background Knowledge on RF Training Process 11
randomness to the ensemble, individual DTs in the ensemble are required to have minimum
bias in order to achieve low generalisation error [6].
Common Model Parameters
During the training process, the set of parameters that most influence the characteristic of the
RF model are:
• T , the number of DTs in the ensemble.
• mtry, the number of candidate attributes randomly selected from all attributes available
at each node.
Theoretical analysis shows that as T increases, the generalisation error of the ensemble converges
to a lower bound [6]. Further increase in T will not improve the accuracy but affect the
computational efficiency. A set of empirical experiments have demonstrated the same outcome
[3]. mtry is the most commonly used tuning parameter in practise. It is recommended to
use
√
m as a default value to start with, where m is the total number of attributes available.
Smaller mtry results in a lower correlation among DTs in the ensemble, at the same time,
weakens the predicting power of individual DT. Breiman suggests in [6] that the generalisation
accuracy of RF is proportional to the predicting power of the DTs and is inversely proportional
to the correlation among the trees. A common practise is to use
√
m as a starting point and
then lower and raise mtry until best generalisation accuracy is obtained.
Out of Bag Error
Each DT in the ensemble is trained from a bootstrapped sample (sampling with replacement) of
the original training dataset (The bootstrap sample has the same size as the original dataset).
During each sampling stage about 37% 1 of the instances in the original dataset do not get
selected. This part of unselected instances are called out-of-bag (OOB) data. A very useful
1limn→∞(1− 1/n)n = e−1 ≈ 37%
12 Chapter 2. Background
byproduct of the RF algorithm is that OOB data can be used as test dataset to evaluate the
quality of the RF model so that it is not necessary to explicitly spare a part of the original
training dataset for the evaluation purpose. It is demonstrated that the error rate based on
OOB data (known as OOB error) can accurately estimate the generalisation error [5]. In the
following chapters the accuracy of the models is measured by using OOB error.
2.1.3 Very Fast Decision Tree
The RF/DT training algorithms introduced so far belong to the class of batch training algo-
rithms. It assumes that all the training data used in the training process is available beforehand.
This is not the case in many applications where the training data arrives in batches or in the
form of streaming data. As a result RF/DT must be grown incrementally based on current
training data available. Very Fast Decision Tree (VFDT) [20], also known as Hoeffding Tree is
such an incremental DT training algorithm targeting streaming training data. The key theory
behind Hoeffding Tree is that when a non-leaf node is split during the training process, the
optimisation of the objective function I in Eq. 2.5 can be performed with high confidence by
using only a subset of the training data partition Si as long as certain criteria are met. Si
contains all the training instances that reach the node i, which is ever growing in the context of
streaming training data. By allowing the search to proceed for the optimal weak learner based
on only a part of the training instances reaching partition Si, Hoeffding tree enables an on
the fly processing of the training data that is currently available. During the training process,
VFDT does not forget the information from the old training data. Therefore in the case of
concept drift old trees may be discarded for update.
The criterian that needs to be met at each node to guarantee a confident split is based on
Hoeffding bound [18] which states that with probability 1 − δ, the true mean of a real-valued
random variable r is at least r¯ − , where r¯ is the mean value computed from n independent
observations and  is defined as:
 =
√
R2ln(1/δ)
2n
(2.8)
2.1. Background Knowledge on RF Training Process 13
Training instances
Batch #1
Stage A – Grow tree
Training instances
Batch #2
Temporary
 leaf-node
Stage B – Traverse 
new instances
Root node
New root 
node
Figure 2.5: Illustration of VFDT algorithm
where R is the range 2 of random variable r. The bound holds regardless of the distribution of
the variable r. Now consider r¯ as the difference in the Gini impurity reduction measurement,
I¯, between θa, the candidate weak learner with the highest score and θb, the candidate with the
second highest score based on the training instances received so far. Then the true difference
r based on all the training instances reaching the node will be at least r¯ −  with probability
1− δ. If r¯−  > 0, then r > 0 and indeed the best weak learner obtained so far is the true one.
As a result the criterion to be met for a confident split is that:
I¯a − I¯b >  (2.9)
where I¯a and I¯b are the Gini impurity reduction measurement for θa and θb as defined in
Equation. 2.5. The criterion acts as a stopping criterion determining whether the training
instances reaching a node so far can lead to a confident split or not. If the criterion is met then
the recursive split continues as in the batch training algorithms, otherwise the node should wait
for more training instances to arrive before another attempt of split. The incremental training
process is illustrated in Figure. 2.5 in which there are two stages happening in turn, in stage
A a DT is grown to a point where no more leaf-nodes can be further split, in stage B new
2For instance, the range of a probability is 1
14 Chapter 2. Background
training instances traverse through the existing tree and reach one of the temporary leaf-nodes,
the leaf-nodes now become new root nodes. The algorithm then switches back to stage A in
which new attempts of split are performed.
2.2 Related Works on Accelerating RF/DT Training Pro-
cess
This section contains previous works on accelerating not only RF but also the single DT training
process on various hardware platforms. Intuitively RF as an ensemble of DTs can incorporate
with minor modification the approaches proposed for single DT induction. Here the focus is
put on training with the Axes-parallel weak learner and numerical attributes. Approaches used
in the previous works fall into several categories. The first category targets the time-consuming
repetitive sorting. Strategies used include replacing the repetitive sorting with a one-off pre-
sorting or distributing each sorting task to parallel workers. Some approaches take it one step
further by discarding the sorting completely and replacing it with certain data sampling or
approximation methods. The second category focuses on partitioning the workload involved
in the training process and distributing them to parallel workers. There are many layers of
parallelism in the RF/DT training process. The top layer is that the component DTs in the
ensemble can be trained concurrently. For a single DT, there are two different ways to partition
the workload. One is Task parallelism in which the workload is partitioned with respect to the
nodes in a DT so that a set of workers can process different nodes (tasks) in parallel. The
second one is Data parallelism in which the group of workers together process one node at a
time exploiting data parallelism. Each worker is assigned a subset of the training dataset that
reaches the node. Depending on the way the dataset is partitioned, data parallelism can be
further divided into two categories. A Vertical partition means that the dataset is partitioned
with respect to the attributes, alternatively a Horizontal partition refers to partitioning with
respect to the instances in the dataset. In both cases workers will find the local best weak
learner before a global comparison is performed to determine the global optimal one. In some
2.2. Related Works on Accelerating RF/DT Training Process 15
works a hybrid scheme is adopted where different types of parallelism are switched in different
training stages. In the following sections the previous works will be introduced with respect to
the device platforms on which the implementations are built. The general approaches used by
the various works are summarised in Table. 2.1 for comparison.
2.2.1 GPP platforms
In SLIQ [34], Mehta et al. introduced a dedicated data structure for the training data, which
enables a training process that requires only a one-off sorting instead of repetitive sorting,
hence achieved considerable reduction in training time. However the scalability of SLIQ is
limited by the requirement to store in the memory a part of data that grows as the size of the
training dataset increases. Shafer et al. later proposed SPRINT [45]. SPRINT modifies the data
structure in SLIQ and removes the memory-bound limitation at the cost of increase in the size of
the disk-resident data. In addition, a parallel training scheme based on horizontal partition was
also proposed in SPRINT. Each worker collects local statistics of its own data partition before
communicating with the others to aggregate a global sufficient statistics which is needed for
searching for the optimal weak learner. Amado et al. [1] proposed a similar training algorithm
with addition of the capability to handle missing data. In this refinement work the algorithm
was also enhanced by addition of a hybrid parallel training scheme which took advantage of both
task parallelism and data parallelism. Joshi et al. in [24] argued that the sequential hashing
scheme used in SPRINT requires a hash table of size O(n) where n is the number of training
instances. In the case where the hash table can not fit in the memory, repetitive access to the
disk-resident hash table would severely affect the training speed. To tackle the issue, a parallel
hashing scheme was introduced in ScalParC in which the hash table and the hashing process
was partitioned and distributed to parallel processors with local memory. In PDT [27], similar
horizontal partition strategy was used to distribute the workload. It differs from SPRINT and
ScalParC in that the sorting of each numeric attribute list of the entire training dataset is
replaced by a set of concurrent local sorting performed in the parallel workers. Due to the lack
of global sorting, the selection of candidate split thresholds τ is based on local sorted attribute
16 Chapter 2. Background
Table 2.1: Summary of the previous works on implementing RF/DT training process
Project name/
Developer(s)
General approach for acceleration
GPP platform
SLIQ [34] Replacing repetitive sorting with one-off sorting
SPRINT [45] Replacing repetitive sorting with one-off sorting;
Horizontal data parallelismScalParc [24]
Amado et al. [1]
Replacing repetitive sorting with one-off sorting;
Task parallelism;
Horizontal data parallelism
PDT [27] Horizontal data parallelism
BOAT [12] Replacing sorting and linear optimal weak learner searching
with approximation based on samplingCLOUDS [43]
pCLOUDS [48]
Replacing sorting and linear optimal weak learner searching
with approximation based on sampling;
Task parallelism;
Horizontal data parallelism
SPIES [23]
Replacing sorting and linear optimal weak learner searching
with approximation based on sampling;
Horizontal data parallelism
GPU platform
Sharp [46] Horizontal data parallelism
CudaRF [16] DT-wise parallelism
Nasridinov et al. [38] Task parallelism
CudaTree [29] Task parallelism;
Horizontal data parallelismgpuRF [22]
FPGA platform
Narayanan et al. [37]
Vertical data parallelism
HC-CART [9]
Rastislav et al. [28]
Partitioning polynomial calculation;
DT-wise parallelism
2.2. Related Works on Accelerating RF/DT Training Process 17
values while in SPRINT and ScalParC the candidates are the midpoints of consecutive values
in a global sorted attribute list. Gehrke et al. introduced BOAT [12] in which splitting a node
is done in two stages. In the first step a part of statistics is collected from a set of bootstrap
samples of the original training data, by which the search space for the optimal attribute and
split threshold are narrowed down. Then in the second step a fine-grain search is performed in
the refined space of the original training dataset. In addition the sorting of numerical attribute
values is also removed. The candidate thresholds are determined by constructing an interval
with high confidence. The optimal threshold is found among the values falling into the interval.
The two-stage search avoids a complete scan of the training data partition at any node during
the training process. With the removal of the sorting process as well, together the total training
time is significantly reduced. A similar idea is also adopted in CLOUDS [43] and SPIES [23]
in which two-stage search is adopted. At the first stage both algorithms divide a numerical
attribute list into several intervals. In CLOUDS intervals are constructed based on samples
of the training data while in SPIES intervals are evenly distributed in an attribute list. Then
different quality estimates based on the samples of the training data are applied to the group
of intervals and remove those intervals with low possibility of containing the optimal split
point. At the second stage, fine-grain search is performed on the refined group of intervals.
Statistics are collected from all the training data that fall into the promising intervals before a
local best weak learner is determined for an attribute list. The process is then repeated on all
attribute lists before a global optimal weak learner is found. SPIES also introduces a parallel
processing scheme based on DT framework RainForest [13]. CLOUDS is parallelised using a
hybrid parallel processing scheme in pCLOUDS [48] in which the algorithm switches from data
parallelism (which is ideal for large nodes) to task parallelism when the size of nodes becomes
smaller at deeper level of a DT.
Apart from the works above that are dedicated to improve the training speed, there are a few
GPP-based RF training tools that are self-contained and well developed. Popular tools include
randomForest [30] and Random Jungle [44] in R project [19] and scikit-learn[41] in Python
to name a few. These tools are widely used by researchers in data mining, machine learning
bioinformatic and other academic communities.
18 Chapter 2. Background
2.2.2 GPU platforms
In [46] Sharp proposed a GPU-based RF trainer written in HLSL language. It uses Direct3D
and vertex shaders for the purpose of object recognition. A set of works have been proposed
by using the CUDA programming interface. CudaRF [16] implements a coarse-grain parallel
training scheme on GPU where RF ensemble is trained by a CUDA kernel with multiple threads,
each thread inducing one DT. Nasridinov et al. proposed in [38] another implementation that
takes advantage of task parallelism. Each thread in the CUDA kernel processes one node in the
DT. Liao et al. in [29] argue that the previous two works ”seem to under-utilise the available
parallelism of graphics hardware...”. They proposed CudaTree, a RF trainer featuring a hybrid
parallel training scheme switching between data parallelism and task parallelism. The strategy
used is similar to the one in [1]. The switch criterion is determined based on a set of empirical
experiments. gpuRF [22] is another implementation that uses task parallelism to maximise
utilisation of the parallel workers. In gpuRF each block of threads is responsible for a node in
the DT and each thread evaluating one candidate split point.
2.2.3 FPGA platforms
Few works have been proposed for training RF/DT on FPGA platforms. In [37] FPGA is used
to accelerate the task of searching for the optimal split point (weak learner) while the sorting
process is performed by the host computer. The parallel scheme is based on vertical data
parallelism, a set of parallel workers are instantiated in FPGA, each targeting one particular
attribute list. This is followed by a comparing process to determine the global optimal weak
learner. Chrysos et al. proposed HC-CART [9], a GPP-FPGA-based heterogeneous system
targeting DT training problems. The system contains an FPGA-based coprocessor dedicated for
accelerating the split of categorical attribute list (The system also supports numerical attribute
which is handled in software). The coprocessor is built upon Convey HC-1 server which contains
multiple FPGAs and a scalar processor. The architecture takes advantage of the large memory
bandwidth between the FPGAs and the shared memory and scales up the parallel worker
network over all four FPGAs. Its parallel processing scheme is also based on vertical data
2.3. Conclusion 19
Table 2.2: Difference among various FPGA-based implementations
Project name/
Developer(s)
Part mapped
on FPGA
Attribute type
supported in hardware
weak learner
used
Narayanan et al. [37]
Searching of
optimal split
numerical axes-parallel
HC-CART [9]
Searching of
optimal split
categorical axes-parallel
Rastislav et al. [49]
Complete
training
numerical oblique/non-linear
Proposed
Complete
training
numerical axes-parallel
parallelism which is similar to [37]. The group of local optimal weak learners determined by
the FPGAs are aggregated in a scalar processor located in the server before a global optimal
weak learner is determined. In [49] Rastislav et al. proposed an FPGA architecture that
accelerated the training process of DT ensemble based on bagging [4] 3. The architecture
implements HereBoy [28], an evolutionary algorithm that induces oblique DTs. The hardware
implementation does not take advantage of either task parallelism or data parallelism, instead
the oblique weak learners involved in the training algorithm are decomposed and calculated in
parallel by a parallel processing module. In addition, an optional, second layer of parallelism
is also in place in which multiple DTs can be induced at the same time. The difference among
various FPGA-based works introduced above are compared in Table. 2.2.
2.3 Conclusion
In this chapter the algorithms that are directly related to the proposed works are introduced in
detail, including the original RF training algorithm and VFDT, a single DT training algorithm
targeting streaming training data. Also introduced is a series of implementations of DT/RF
training process that are designed for various devices. For GPP implementations, a diversified
approach has been taken to improve the RF training speed, that includes not only exploiting
3RF ensembles are also based on bagging but with additional step of random attribute selection
20 Chapter 2. Background
inherent parallelism in the algorithm but also replacing time-consuming component in the
algorithm with simplified solution. In contrast, GPU and FPGA implementations tend to
focus on efficient distribution of workload among parallel processing elements, partly due to
the fact that GPUs and FPGAs can easily allow large-scale parallel processing that multi-core
GPPs can not match. It is also observed from Table. 2.2 that compared with GPP and GPU
implementations, FPGA-based systems are more often used as coprocessors to accelerate a
part of the training steps. The rest of the training steps are still performed in software. The
advantage of FPGAs in Performance/Power consumption ratio may not be well reflected in such
hybrid systems. Although the work in [49] is a self-contained design solely based on FPGA
however it targets oblique/non-linear weak learner which is less commonly used than axes-
parallel learner. Our proposed architectures tend to address the issue by keeping the complete
training process on FPGA side and targeting axes-parallel weak learner.
Chapter 3
Hardware Framework and Task
Parallelism Based Architecture
In this chapter a novel FPGA-based hardware framework is introduced for the RF training
process. The framework is designed based on the CART method and follows the breadth-first
approach. It contains a set of optimised hardware components that implement different steps
in the CART and allows a flexible arrangement of these components to exploit different parallel
processing schemes. A training architecture based on one of such schemes is also proposed in
this chapter. It adopts the task parallelism approach which uses multiple processing elements
to process independent workloads located at the same level of a decision tree.
The chapter is structured as follows. Firstly the hardware framework for the training process
is presented, with detailed description of the workflow in hardware as well as the architectures
of various components contained in the framework. The task parallelism based training ar-
chitecture is then presented, also with details of the hardware design. This is followed by an
evaluation section which assess the training architecture.
21
22 Chapter 3. Hardware Framework and Task Parallelism Based Architecture
3.1 Hardware Framework for RF Training Process
3.1.1 Overview of Workflow
It is assumed that the training dataset is stored in an external memory in the first place. Before
the training starts, the entire dataset is moved from the external memory to FPGA embedded
memory. The decision trees that comprise the RF ensemble are produced in sequential manner.
Each decision tree is trained on a subset of the original training dataset by using a sampling
with replacement. When the training begins, each decision tree is induced in a circled workflow
as shown in Figure. 3.1.
In the example shown in the figure, the sampled dataset contains four instances, each with
two attributes, an additional column contains the labels indicating the class that each instance
belongs to. The training process starts by initialising the first training data partition which
contains all the training instances in the subset. Then a part of the data is extracted from the
partition after randomly selecting one of the attributes available. At this point it is assumed
that the parameter mtry is fixed to one, meaning that only one attribute is selected. In the
example, attribute #2 is selected. The resulting data list extracted contains the index, the
selected attribute value and the label for each training instance in the partition. The data list
is then sent to a sorting module where the entries in the list are sorted with respect to the
selected attribute values (attribute #2 in the example).
The output is then fed into a split module. In the split module, the search for the optimal
weak learner is performed. The search aims to find an optimal split point so that the entries
in the data list fall into one of two branches according to the indicator function [xa ≥ τ ] where
xa is the value of the selected attribute. The optimal threshold τ is selected from a group of
candidates produced by taking the median of two adjacent and non-identical attribute values in
the data list. In the example, τ is equal to 9 which is selected from the candidate set {4.5, 9, 11}.
At this point the original training data partition can be split into two new partitions according
to the split point found for the data list. In the example, one new partition contains instance
#2 and #3 and the other one contains instance #1 and #4. New partitions are placed in a
3.1. Hardware Framework for RF Training Process 23
1
2
3
4
LabelAttribute 1 Attribute 2
1
3
2
5
Index
12
8
1
10
1
1
0
1
Unsorted 
datalist
Random attribute 
selection
SortingSorted 
datalist
Split
Two new 
partitions
Training 
data 
partition
1
2
3
4
LabelAttribute 2Index
12
8
1
10
1
1
0
1
3
2
4
1
LabelAttribute 2Index
1
8
10
12
0
1
1
1
3
2
4
1
LabelAttribute 1 Attribute 2
2
3
5
1
Index
1
8
10
12
0
1
1
1
New 
partition #2
New 
partition #3
Partition #1
(entire subset)
Queue
Can be split?
No
Yes
Queue empty?
Start DT training
Yes
End of DT training
No
Load next 
partition
Data extraction
Original 
training 
dataset
Sampling with 
replacement
Figure 3.1: Workflow of DT training process
24 Chapter 3. Hardware Framework and Task Parallelism Based Architecture
Sorting 
module
Split module
Task queue
Task 
allocation 
module
Processing element
Memory 
array
Task queue Task queue
Results 
collection 
module
External memory
Training framework
Sorting 
module
Split module
Task queue Task queue
…
Figure 3.2: Top-level architecture of training framework
queue, waiting for further process. On loading a new partition, it is checked whether further
split is necessary. A partition with only one type of label can no longer be split. The iteration
continues until there are no more partitions left in the queue, which indicates the completion
of the training of a single DT. After that the framework will proceed to the next DT if the
required number is not reached yet.
The framework is designed based on breadth-first training approach. i.e. processing with
priority the training data partitions located at the same level in a DT. In the example partition
#1 is at level 0 (root node), its branches partition #2 and #3 are at level 1 and so on so forth.
Since the workloads derived from the partitions at the same level are independent from each
other, they can be processed in parallel. This approach differs from the depth-first approach in
which consecutive workloads are dependent. Breadth-first approach is considered more suitable
for the proposed framework since it allows both task parallelism and data parallelism while
depth-first approach only supports data parallelism.
3.1. Hardware Framework for RF Training Process 25
3.1.2 Top-level Architecture
The top-level architecture of the training framework is shown in Figure. 3.2. It consists of
five major components including Task allocation module, Memory array, Sorting module, Split
module and Results collection module. The five components together implement the workflow
shown in Figure. 3.1. Memory array stores the original training dataset. Sorting module sorts
the data list extracted from the training data partition with respect to the attribute selected.
The partition is split and new partitions are produced in the split module. The workload in the
framework is managed by the task allocation module. Finally the training results are collected
in the results collection module.
The workload involved in the training process is managed as Task. A task is defined as the
workload derived from a training data partition that appears in the cycled workflow. In the
architecture a task queue is attached to the task allocation module, sorting module and split
module respectively. The task allocation module loads a task from the queue and assigns it to
the sorting module. When the job is completed in the sorting module, the same task is passed
to the split module meanwhile the sorting module is ready for the next task. Hence all three
modules work as a pipeline. When the split is done, the split results (i.e. training results) are
collected in the results collection module. Based on the results two new tasks are generated
representing the newly produced partitions. These new tasks are placed in the queue of the
task allocation module, waiting for the next round of processing.
The sorting module and the split module are combined as a processing element (PE). The
training framework is flexible to adopt parallel processing based on both task parallelism and
data parallelism. In either cases, multiple PEs can be instantiated to handle independent
workloads. Depending on the parallel scheme used, the task allocation module and the results
collection module will have different hardware designs for distributing the tasks and collecting
results.
Details on the memory array, sorting module and split module will be given in the following
sections. The architectures of task allocation module and results collection module vary for
26 Chapter 3. Hardware Framework and Task Parallelism Based Architecture
different parallel schemes hence their details will be given in dedicated sections introducing the
training architectures.
3.1.3 Memory Array
Memory array is where the training dataset is stored after being imported from the external
memory. A dataset contains a group of training instances, each instance consisting of d attribute
values plus one label value. During the RF training process, each DT in the RF ensemble is
trained on a random subset of the dataset. During the training process for each DT, data
relating to a partition is loaded from the memory array and is sent to the processing element.
In Figure. 3.3 the architecture of the memory array is shown. It contains a pair of memory units,
RAM unit A and B, each storing a different part of the training instances. RAM unit A stores
the attribute values and the label values are stored in RAM unit B. The read/write addresses
for both RAM units are generated by the same pair of address generation components.
RAM unit A is further divided into n blocks where n is the number of attributes in the training
instances. As shown in the figure, each block stores the values of all the training instances with
respect to one particular attribute. Consider an example where a training dataset contains 512
instances, each with 256 attributes. When it is imported to the memory array, the data arrives
in 257 batches, each batch having 512 entries. The first 256 batches containing attribute values
are sent to their corresponding blocks in RAM unit A and the last batch containing label values
is sent to RAM unit B.
A write address generated in the memory array comprises two parts, a MSB part containing d
MSB bits in the address and LSB part which refers to n LSB bits in the address. The total
length of the address is therefore d + n bits. For RAM unit A, MSB part determines which
block is being accessed and LSB part determines the specific location in the block that the data
is written to. For RAM unit B, since it is not separated into multiple blocks and just stores
one batch of data, only the LSB part is used as the write address for the unit. The MSB and
LSB parts are driven by separate counters which increment accordingly when new data is sent
to the memory array from the external memory.
3.1. Hardware Framework for RF Training Process 27
RAM unit A
MSB
LSB only
Attribute random 
selection
Write address
External memory
Instance #1, Attribute #1
Instance #2, Attribute #1...
Instance #n, Attribute #1
Block #1
Instance #1, Attribute #2
Instance #2, Attribute #2...
Instance #n, Attribute #2
Block #2
...
Instance #1, Attribute #d
Instance #2, Attribute #d...
Instance #n, Attribute #d
Block #d
Instance #1
Instance #2...
Instance #n
RAM unit B
Attribute values
Label values
LSB
Counter Counter
Write address generator
MSB
Read address
LSB
Read address generator
Index FIFO
LSB only
Memory array
Processing element
Attribute values
Label values
Indices
Figure 3.3: Architecture of memory array
28 Chapter 3. Hardware Framework and Task Parallelism Based Architecture
The read addresses for the RAM units have the same format as in the write address but are
driven in a different way. According to the workflow shown in Figure. 3.1, the data required for
the sorting step and the split step is the Unsorted data list which, as shown in the figure, is ex-
tracted from the training data partition that is under process. The data extracted contains the
values of selected attribute, the label values and the indices of the training instances contained
in the partition. The MSB part in the read address is determined by the attribute randomly
selected. The LSB part, which indicates which training instance is to be read, is stored in a
different memory space. The idea is similar to the Pointer used in a computer system. As
shown in Figure. 3.3, there is a dedicated memory component, Index FIFO (which is located in
the split module), storing the indices of the training instances contained in a partition. When
the PE extracts data relating to a partition, it first reads the indices from the index FIFO. Each
index read forms the LSB part and is concatenated with the MSB part to form the complete
read address.
The indices in the index FIFO are first initialised at the beginning of each DT training pro-
cess. Since each DT in the RF ensemple is trained on random samples (with replacement) of
the original training dataset, the sampling is performed by generating from a random integer
generator a set of indices and storing them in the index FIFO. These indices comprise the first
partition in the cycled workflow. After the split is done, the indices in the partition will fall
into one of the two new partitions. The indices will be pushed back to the index FIFO but
in different order i.e. all the indices that go to the first new partition will be pushed into the
index FIFO before the indices that go to the second new partition are pushed in. When a task
is assigned to a PE, it will inform the PE how many instances are contained in the partition
in the task so that the PE will load correct number of indices from the index FIFO. Take the
example in Figure. 3.1 for clarification, partition #1 before split contains training instances
#1-#4, after the split instance #2, #3 go to partition #2 and the rest go to partition #3.
The indices of the instances are then be pushed back to the index FIFO in the order: 3, 2, 4,
1. In the next iteration when partition #2 is sent in for split i.e. a new task, the PE will be
informed that there are two instances in the partition. The PE will then read from the index
FIFO two consecutive entries which are the correct indices for the instances in partition #2.
3.1. Hardware Framework for RF Training Process 29
> > > >
Block 2Block 1 Block 3
buffer buffer buffer
i_LengthSeq
LENGTH_BLOCK
i_Value
Boundary 
address 
generation logic
RAM block, Depth = k
>
ReadData
FIFO buffer, Depth = k/2
o_Value
R/W
address
generation logic
Task
generation logic
FSM
Buffer
R/W enable
Figure 3.4: Architecture of FIFO-based merge sorter
3.1.4 Sorter Module
The data read from the memory array are sorted in the sorting module with respect to the
values of the attribute selected. The sorting module contains a cascaded FIFO-based merge
sorter. FIFO-based merge sorter features fast output throughput and low hardware resource
utilisation. It is considered one of the most suitable sorter types for FPGA [33]. In this section
a novel FIFO-based merge sorter is proposed to perform the sorting step. The sorter features a
50% reduction in the memory resource utilisation when compared to a straightforward design.
The architecture of the sorter is illustrated in Figure. 3.4. The sorter contains a cascaded
structure comprising several stages, each being able to merge two 2n-entries sorted sequences
into one 2n+1-entries sorted sequence, where n = 0, 1, 2... refers to the stage number. When
the stages are combined, an n-stage sorter is able to sort at most a 2n-entries sequence. All
the stages work in a pipelined fashion, meaning that it is able to sort multiple sequences at the
same time.
The workflow for the merge at a stage is explained as follows. Consider a scenario where two
sorted sequences, each having k entries are merged. The first sequence that arrives the stage is
referred to as sequence A and the other one is referred to as sequence B. The RAM block will
be filled by sequence A at first. The merge starts on the arrival of the first entry of sequence
30 Chapter 3. Hardware Framework and Task Parallelism Based Architecture
Virtual FIFO B Virtual FIFO A
>>
RAM block >>
FIFO A
FIFO B
Figure 3.5: Two virtual FIFOs in a single RAM block
B. Starting from the first entry in both sequences, a comparison is performed between the two,
the smaller one is exported to the next stage, while the bigger one proceeds to be compared
to the next entry from the counterpart sequence. The comparison continues until one of the
sequences is completely exported to the next stage, then the remaining entries in the other
sequence will be exported directly without further comparison.
The novelty in the sorter lies in a unique way the two input sequences are stored at each
stage. A straightforward design would use two FIFO blocks with depth k to hold the two input
sequences, as shown in the lower part in Figure. 3.5, a third FIFO block with depth k (not
shown in the figure) is needed to buffer the output sequence from the previous stage in order to
avoid interruption to data flow. In the proposed design a RAM block with k words along with
a FIFO with depth k/2 are used to store the same input sequences. The idea is to use a single
RAM block to accommodate two virtual FIFOs as illustrated in the upper part in Figure. 3.5, it
is based on the observation that the sizes of two virtual FIFOs change dynamically during the
merge but the total size never exceeds k if the input throughput and the output throughput
at each stage are the same. During the merge, entries from either sequence A or B could
be exported. Whenever an entry from a sequence is exported, its memory location becomes
available for a newly arrived entry from sequence B, hence the virtual FIFOs have changing
depth. However such arrangement does not allow a straightforward mapping into hardware.
Due to the unpredictability of the comparison results, the locations where the new arrival to be
written is not known a priori. As a result the read addresses for the entries in sequence B will
be completely random. The solution in the design is using a buffer to hold a new arrival until
3.1. Hardware Framework for RF Training Process 31
va11 va12vb10 vb5 vb6 va7 va8 va9 va10
Virtual FIFO B Virtual FIFO A
vb7 vb8 vb9
Vb11
(new arrival)
Desired 
predictable 
location
Next available 
location
RAM block
Buffer
Figure 3.6: Example of buffering new input during a merge
a predictable memory location becomes available, then moving the arrival to that location. In
this case a predictable memory location refers to a consecutive location in the virtual FIFO B.
To better elaborate the design, consider the example in Figure. 3.6 in which a 12-word RAM
block is shown. Virtual FIFO A and B store the two input sequences A and B respectively. At
this point va1-va6 from sequence A and vb1-vb4 from sequence B have been exported to the next
stage and their locations have been re-allocated to the following entries from sequence B. Now
the next comparison will be performed between va7 and vb5. Assume that va7 will be exported,
making its location the next available location for the new arrival vb11. In order to maintain
the location to be predictable, a desirable location for vb11 would be the one next to vb10, but
it is still occupied by vb5, as result the new arrival will be stored in the buffer until the desired
location becomes available and then vb11 will be moved to it. The maximum depth needed in
the buffer is equal to half the size of the RAM block, therefore the total memory space needed
for the input sequences is lowered by half when compared to a straightforward design.
Note that the input to the sorting module is the Unsorted datalist which is illustrated in
Figure. 3.1. Each entry in the list contains the index, the value of the attribute selected and
the label of a training instance in the training data partition. During the merge at each sorting
stage, only the attribute values are used for comparisons, the rest are simply attached to the
attribute values. After being sorted with respect to the attribute values, the Sorted datalist
enters the split module in which the search for an optimal split is performed.
32 Chapter 3. Hardware Framework and Task Parallelism Based Architecture
Index Attribute value Label
Input entry of sorted data list
Split quality
measurement s(·)
Label 
histograms 
storage
Index FIFO
Threshold 
generation
Split module
Label histogram 
left / right
1/n
ROM
Best s(·)
so far
Best label 
histogram
left / right
Best threshold 
Indices of training 
instances
Best threshold
Best histograms
Leaf/Non-leaf flag
Training results
Figure 3.7: Architecture of split module
3.1.5 Split Module
The splitting step shown in Figure. 3.1 is performed in the split module, in which the training
data partition under current process is split into two new partitions. The module takes as
input a data list which is previously sorted with respect to the values of the attribute selected.
Each entry in the list represents a training instance in the training data partition. A search is
performed to find an optimal split point among the entries in the data list, so that the resulting
statistic maximises the objective function defined in Equation. 2.5. The output of the module
is a set of data relating to the optimal split point as well as the newly produced partitions.
These data are collected by the results collection module.
In Figure. 3.7 the architecture of the split module is shown. A split is done according to the
indicator function [xa ≥ τ ] where xa is the value of the attribute selected and the threshold τ
is selected from a group of candidates. The candidates are produced by taking the median of
two adjacent, non-identical attribute values contained in the sorted data list. Therefore each
arrival of the entry with different attribute value to the one in the previous entry produces a
candidate split point (except for the first entry). As a result the search starts on arrival of the
second entry of the sorted data list. The new partitions to be generated are referred to as left
3.1. Hardware Framework for RF Training Process 33
branch and right branch respectively. The entries that have arrived are considered as members
of the right branch while the entries yet to come comprise the left branch. When an entry
in the data list arrives, the index is directly stored in Index FIFO. The content in the index
FIFO is used in the generation of the read address in the memory array. The use of the FIFO
is described in detail in Section. 3.1.3. The attribute value is used to produce the candidate
threshold in the Threshold generation component.
The label value is sent to the Split quality measurement component where the quality of current
candidate split point is assessed. The quality of the split (or the weak learner) is measured
by the objective function Equation. 2.5 which is based on Gini impurity in the partitions
(Equation. 2.6) in this work. In order to calculate the measurement more efficiently in hardware,
instead of implementing the two equations directly, an optimised measurement s(·) that is
proposed in [37] is used in the architecture with the assumption that there are no more than
two different labels:
s(·) = n(L,i) · n(L,j)
n(L,i) + n(L,j)
+
n(R,i) · n(R,j)
n(R,i) + n(R,j)
. (3.1)
Where n(·) refers to the number of entries in a branch (left L or right R) with a particular label
(label i or label j). There variables are counted in a pair of label histogram counters, Label
histogram left/right. The histograms are initialised before arrival of the first entry. Since at the
beginning all the entries are considered to be in the left branch, nR,i and nR,j are initialised
to be zero while nL,i and nL,j are initialised by taking values from Label histograms storage.
The component stores the label histograms for all the training data partitions in the queue.
To avoid the division in the hardware, all possible results of 1
n(L,i)+n(L,j)
and 1
n(R,i)+n(R,j)
are
pre-calculated and stored in a ROM block, 1/n ROM. The size of the ROM is equal to the
number of instances in the training dataset since the maximum possible size of a training data
partition is equal to that of the training dataset.
The temporary best threshold and label histograms are stored in Best threshold and Best
histogram left/right respectively. The values stored get updated whenever a better split quality
measurement is obtained. At the end of the search when the last entry in the input data list
arrives, the values remaining in the two components represent the optimal split point and are
34 Chapter 3. Hardware Framework and Task Parallelism Based Architecture
output to the results collection module. An additional output, Leaf/Non-leaf flag is also sent, it
indicates whether the newly generated partitions need to be further split or not. The partitions
with only one type of label are not going to be split and are considered as a leaf. Meanwhile
the best label histograms are now also the initial label histograms for the two newly generated
partitions. The histograms are written back to the label histograms storage, waiting to be
loaded in the following iterations.
The variables involved in the calculation are in the format of custom fixed point. The fraction
part length is determined so as to have enough precision to preserve the minimum gap between
the sorted attribute values in the training data and the integer length is set based on the
absolute maximum value in the training data.
3.2 Task Parallelism Based Architecture
In the previous section, the hardware framework for Random Forest (RF) training process is
introduced. A part of the major components in the framework, including the memory array and
processing element (PE) are also described in detail. The other components, the task allocation
module and the results collection module are not introduced in detail since their architectures
vary depending on the parallel processing scheme adopted for the training architecture that is
built upon the framework.
In this section an RF training architecture based on a task parallelism scheme is introduced.
The workload involved in the decision tree (DT) training process is separated with respect to
the tree nodes (i.e. training data partitions). These independent workloads are processed by
different PEs in parallel. The task allocation module and the results collection module are
specially designed to support this parallel scheme. In the following sections details of the two
modules will be given.
3.2. Task Parallelism Based Architecture 35
PE #1
Two new 
partitions
Partition #1
(entire subset)
Queue
Can be split?
No
Yes
Queue empty?
Start DT training
Yes
End of DT training
No
Load next 
partition
Original 
training 
dataset
Sampling with 
replacement
Allocation
Local 
queue #1
Local 
queue #2
Local 
queue #d...
PE #2 PE #d...
Two new 
partitions
Two new 
partitions
Training 
data 
partition
Data extraction
Aggregation
Random attribute 
selection
Unsorted 
datalist
Figure 3.8: Task parallelism based workflow of DT training process
36 Chapter 3. Hardware Framework and Task Parallelism Based Architecture
3.2.1 Overview of Task Parallelism Scheme in Hardware
An updated workflow that adopts the task parallelism is illustrated in Figure. 3.8 (note that
Sorting, Sorted data list and Split appearing in the original workflow are now replaced with
PE ). The difference to the workflow in Figure. 3.1 is that instead of processing the partitions
i.e. tasks in a sequential manner, the partitions in the Queue are now loaded constantly as
long as they are not empty. After randomly selecting the attribute to be assessed, the unsorted
data list extracted from the partition is allocated to one of Local queues which are connected
by independent PEs. When the PE completes its work, the new partitions generated after the
split are aggregated with the partitions from other PEs. These partitions are then sent back
to the queue in the same order in which their parent partitions were loaded.
Which PE the input data list is allocated to is dependent on which attribute has been selected
for the data list. Each PE in the architecture is designed to be able to process a data list within
a specific range of attributes. For instance, if there are two PEs in place and the training data
contains 10 attributes in total, then PE #1 will be responsible for data list with attribute
#1-#5 and PE #2 will be responsible for data list with attribute #6-#10.
The task allocation is implemented in the task allocation module and the data aggregation
is implemented in the results collection module. Details of the modules will be given in the
following sections.
3.2.2 Task Parallelism Based Task Allocation Module
Figure. 3.9 depicts the architecture of the task allocation module. The main component in the
task allocator is a FIFO structure that is used for queuing the tasks. In hardware each entry in
the task queue is a package of data that informs the PE of the specification of the training data
partition. On receiving the package, the PE will know what data should be collected from the
memory array and other components in order to perform the sorting and split. Each entry in
the task queue comprises three parts of information. The Size indicates the number of instances
contained in the partition, the Label histogram is used to initialise the label histograms in the
3.2. Task Parallelism Based Architecture 37
Size #n
PE #1
Attribute 
selection
Results collection 
module
Task FIFO structure
Size #1
Initialisation
PE selection
PE #2
...
PE #k
Task allocation module
...
Size FIFO
Label 
histograms #n
Label 
histograms #1
...
Label histograms FIFO
Parent PE #n Parent PE #1...
Parent PE FIFO
Figure 3.9: Architecture task parallelism based task allocation module
split quality measurement component in the split module, the Parent PE indicates which PE
processed the parent partition that produced the current partition. The reason for having this
information is that some data stored in the parent PE must be loaded while processing the
current partition.
The first task in the queue is initialised as follows. Since the first partition in the cycled workflow
is the entire subset sampled from the original training dataset, the first task is generated by an
Initialisation component that collects data from the first partition including the size and the
label histograms. The parent PE for the first task is always set to be zero i.e. the first PE in
the group. When a task is ready for allocation, an attribute is randomly selected for it, based
on which the task is pushed into the local queue of one of the PEs. For the rest of the tasks,
the data is collected from the results collection module.
38 Chapter 3. Hardware Framework and Task Parallelism Based Architecture
PE selection 
order FIFO
PE selection
Training results 
from various PEs
Task queue
New tasks 
generation
Training results 
export
Results collection 
module
Figure 3.10: Architecture of task parallelism based results collection module
3.2.3 Task Parallelism Based Results Collection Module
The architecture of the results collection module is in Figure. 3.10. All the results from different
PEs are aggregated at the collector, then in turn the collector pushes the new task information
from each PE into the queue, at the same time export the training results. The order of the
PEs to be collected is the order in which the tasks have been allocated. When the attribute
selector in the task allocation module selects a PE, the selected PE number is immediately
pushed into a PE selection order FIFO. The results collection module will then load the FIFO,
waits to collect the results from the specified PE before proceeding to collect the next one.
The number also serves as a part of the new task data (i.e. Parent PE number), indicating the
location of the indices for the new partitions.
3.2.4 Connection between Index FIFO and Memory Array
As described in Section. 3.1.3, the read address to the memory array is partly generated by the
index loaded from a dedicated memory component, Index FIFO which is located in the split
module. When multiple PEs are instantiated, different blocks dedicated to different PEs in the
memory array must be able to connect to any index FIFO located in other PEs’ split module
so that any PE will be able to process a training data partition that is previously split in a
different PE.
3.2. Task Parallelism Based Architecture 39
Parent PE 
number
Size
PE number
Task 
allocator 
side
Size
PE number
Index FIFO
Read req.
Read requests 
from different 
PEs
Split module side
Memory array read 
address generator
Indices FIFO 
output from 
different PEs
Mem. array 
module 
side
Split module 
side, PE #1
Split module 
side, PE #2
Split module 
side, PE #3
Mem.array 
side, PE #1
Mem. array 
side, PE #2
Mem. array 
side, PE #3
Master PE number 
currently expecting
Figure 3.11: Architecture of multiplexing unit
As a result a multiplexing scheme is in place to manage these connections. Figure. 3.11 depicts
the architecture that implements the scheme. The upper part of the figure shows the high-level
view of the scheme, in which the split module in each PE is able to receive read request from
the memory array block in dedicated to any PE. At the same time, each split module can also
send index data to the read address generator dedicated to any PE. At any point of time, one
split module is paired with one PE. Other PE which wants to have access to the same split
module has to wait until its turn comes.
The lower part of the figure shows how it is implemented in detail. To better elaborate the
scheme, the PE whose memory array module sends the read request is referred to as master-
PE, while the PE whose split module receives the request is referred to as slave-PE. Master
and slave can be the same PE as it happens when a single PE is instantiated. When the task
allocator assigns the tasks, in addition to sending task data to a master-PE as explained earlier,
a second set of task data is sent to the corresponding slave-PE. The information contains two
parts, Size is the same as in Figure. 3.9 and PE number is the ID of the master-PE. Once
received on the split module side, the PE number will be used as the selecting signal to control
the multiplexer which takes as input a set of read requests from different PEs. Meanwhile Size
40 Chapter 3. Hardware Framework and Task Parallelism Based Architecture
will be used as a target number for the counting of the read request, when the targeted number
is reached, the split module side will switch to the next PE number. On the memory array side
in the master-PE, the right index FIFO output is selected by the Parent PE number which is
packed in the task data as illustrated in Figure. 3.9. Note that this number also selects which
PE should be the slave-PE. As mentioned earlier, multiple PE may require access to the same
index FIFO. It is because two data partitions that are being processed by two PEs may come
from the same data partition before being split. In such case the access to the indices FIFO
is granted in the same order in which the tasks are assigned. On split module side, along with
the output of the indices FIFO, an extra signal, PE number is sent to the rest of PEs (at this
time more than one PE are tuned to the slave-PE). It shows the PE number the slave-PE is
currently expecting, only the master-PE with the same PE number will proceed to work on the
task, while others have to hold until a right PE number is shown.
3.3 Evaluation
3.3.1 Training Accuracy of the Framework
In this section the accuracy of the RF model trained by the proposed training framework is
assessed. The procedure in the training process follows the original RF training algorithm
described in [6]. In the hardware architecture, potential impact on accuracy 1 can be caused
by the transformation from typical single/double format of the values in the training dataset
in software to custom fixed-point format in the hardware. These values are used for sorting the
data partition and for calculating the threshold τ in the split module. For the sorting part, the
conversion to the custom fixed-point will not have an impact on the accuracy since the length
of the integer and fraction part of the fixed-point format are selected such as to preserve the
distinction among different entries. On the other hand, the threshold τ in the split module is
calculated by taking median of two consecutive values in the training data. In the hardware
1The impact on accuracy in this context means a gap between the accuracy of the models trained by the
hardware and software implementations under the same conditions
3.3. Evaluation 41
Table 3.1: Property of the training datasets under test
Datasets
Number
of instances
Number
of attributes
Number
of labels
Sonar 208 60 2
Breast cancer 569 32 2
Ionosphere 351 34 2
Vowel 528 10 11
Image segment 2310 19 7
Table 3.2: Comparison of training accuracy (OOB error)
Datasets Double-precision (sd) Custom fixed-point (sd)
Sonar 0.1881 (0.0201) 0.1877 (0.0202)
Breast cancer 0.0403 (0.0045) 0.0408 (0.0037)
Ionosphere 0.0782 (0.0092) 0.0762 (0.0048)
Vowel 0.0547 (0.0066) 0.0556 (0.0065)
Image segment 0.0472 (0.0049) 0.0468 (0.0049)
this is implemented by an addition followed by one bit shifting. It is expected that a reduced
precision in the training data stored in the memory array leads to lower precision in thresholds.
The potential loss in accuracy is evaluated in software by comparing the OOB error before and
after using custom fixed-point format. A MATLAB-based implementation of the RF training
process is constructed, which mimics the training operation in hardware. Five different training
problems are selected from UCI repository [31]. The property of the training datasets are given
in Table. 3.1. These datasets are used merely because they have been also used in the original
RF paper [6]. For each dataset two sets of training are performed. Each set contains 100 repeat
of RF training process for 100 decision trees. The first set is done with double precision, the
second set is done after converting from double precision to custom fixed-point. The length of
the integer and fraction part of the fixed-point format are configured just enough to maintain
distinction among the values in the training datasets. The average OOB error with standard
deviation (sd) obtained from each training set is given in Table. 3.2. For the selected training
datasets negligible loss in accuracy after the conversion is shown.
42 Chapter 3. Hardware Framework and Task Parallelism Based Architecture
Table 3.3: Hardware utilisation of task parallelism based architecture
Max number
of training instances
LUTs Registers Logic
Memory
bits
Memory
M9Ks
Memory
M144Ks
256 4463 (1) 6473 (2) (2) 57600 (<1) 10 (<1) -
512 4977 (1) 6884 (2) (2) 139392 (<1) 21 (2) -
8192 8754 (2) 11691 (3) (3) 3404288 (16) 298 (23) 8 (13)
3.3.2 Evaluation of Task Parallelism Based Architecture
Hardware Utilisation
The hardware resource utilisation of the architecture is application-dependent and is determined
by several factors including the number of entries and attributes in the training dataset, the
number of different labels, the word-length of the custom fixed-point format and the number
processing elements that can be instantiated on a device.
Table. 3.3 shows the hardware utilisation of the architecture with respect to the maximum
number of training instances allowed to be trained. The utilisation is based on a single PE.
The number of attribute is fixed to eight and the word length of each entry is fixed to eight
bits. The figures in the brackets are the hardware utilisation as percentage of total capacity
based on Altera EP4SGX530HF35C2 FPGA. When the size of the training dataset increases,
embedded memory usage outpaces the logic fabric when a single PE is instantiated. Since the
embedded memory resource is mainly consumed in the memory array, the imbalance in the
hardware utilisation can be offset by using more PEs in the architecture.
Training Speed
The training speed of the proposed architecture is compared with two widely used GPP-based
RF implementations, randomForest v4.6.10 [30] in R project [19] and scikit-learn v0.15 [41] in
Python. Both software implementations are running on a computer with 2.6GHz CPU and 8GB
DDR3 memory. The proposed architecture is scaled on a Altera Stratix IV EP4SGX530KH40C2
FPGA.
3.3. Evaluation 43
External 
memory
Qsys project
DMA
Proposed 
architecture
DMA
NIOS II
FPGA
Figure 3.12: Architecture of test environment
A simulation-based test bench is set up in order to perform the evaluation. The test bench is
built up based on the Qsys 2 project as shown in Figure. 3.12 in which the proposed architecture
is added as a custom hardware module. The training dataset is originally stored in the external
memory. When the training begins the training data is transferred to the memory array in the
proposed architecture through a DMA module. Throughout the training process, the training
results produced are sent back to the external memory through a second DMA module. The test
process is controlled by the NIOS processor. The proposed architecture is set to work at 100MHz
since the maximum working frequency allowed for the circuit under test is slightly over 100MHz.
This frequency is obtained without particular optimisation in timing. Therefore there is room
for improvement in the working frequency, which leads to higher training speed. The training
time for Sonar dataset based on 1000 trees (mtry set to 1) is given in Table. 3.4. The original
Sonar dataset contains 208 instances. In order to avoid inaccuracy due to the overhead in the
software training programs, the content in the dataset is duplicated 10 times therefore the actual
dataset under test contains 2080 instances. For the scikit-learn implementation, training times
for both single core and multi cores configuration are given since the implementation supports
multi-core processing. The results show that for the selected dataset the proposed architecture
outperforms the R implementation by around 160× and scikit-learn implementation by up to
37×.
2Qsys is a FPGA system integration tool included in Altera Quartus suite
44 Chapter 3. Hardware Framework and Task Parallelism Based Architecture
Table 3.4: Training time for task parallel based architecture
R
scikit-learn
(single core/duo cores)
Proposed
4.72 s 1.09 s / 0.72 s 29.6 ms
3.4 Conclusion
In this chapter a hardware framework is proposed to accelerate the RF training process on
FPGA devices. The framework is the foundation for three different hardware training architec-
tures proposed in this work. To achieve acceleration, two different approaches are taken at the
same time when design the framework. First the training process is pipelined into several stages
in hardware at high level, secondly the training data stored in FPGA are mapped in a way that
allows task-parallelism or data-parallelism scheme to be implemented. Later in the chapter,
the task parallelism based architecture is introduced. The architecture is able to outperform
popular GPP implementations on the test dataset, which demonstrates the effectiveness of task
parallelism and pipelining technique for the RF training process in FPGA.
Chapter 4
Enhancement to the Framework and
Data Parallelism Based Architecture
The training framework that has been described in the previous chapter has limitations in
three aspects, which altogether place restrictions on the range of its application. Firstly the
hardware design is optimised for the case in which the parameter mtry is fixed to one. i.e.
only one attribute is randomly selected and assessed during the split process for each data
partition. In the original paper [6] where RF was introduced, it was claimed that the average
difference between the error rate based on assessing single attribute and multiple attributes
was less than 1%, however more empirical results from later researches demonstrates that the
parameter happens to be one of few parameters to which the training result becomes sensitive
[14, 32, 15].
This observation makes mtry a major tuning parameter in the RF training algorithm. Therefore
having a fixed mtry can affect the training accuracy in certain cases when compared to other
implementations with the parameter being adjustable. The second limitation is that multiple-
class (with n labels, n ≥ 2) training is not supported. Although multiple-class training can
be generally simplified into multiple binary trainings by using one-vs-all (OvA) or one-vs-one
(OvO) method. However these methods result in significantly longer training time, which
is against the fundamental motivation of the work. The third limitation lies in the relatively
45
46 Chapter 4. Enhancement to the Framework and Data Parallelism Based Architecture
small embedded memory capacity in FPGA devices which can quickly run out for large training
datasets. In the framework the training data is stored in custom fixed-point format which is
not efficient in the case of very sparsely (large dynamic range) or very densely (high precision)
distributed data. This inefficiency not only limits the size of training problems that can be
processed by the architecture, but also reduces the effective embedded memory bandwidth
which is crucial for accelerating the training process in the architecture.
In this chapter several enhancements are introduced to the hardware framework in order to
address the above limitations. New hardware is designed and optimised for adjustable mtry
parameter and multiple-class training. In addition a data compression scheme is introduced
to compress the training data before storing them in FPGA, which leads to a better usage
of memory space for any training dataset. These enhancements improve performance and
flexibility of the training architectures built upon the framework, therefore extend their range
of applications. Also introduced in this chapter is a data parallelism based training architecture
which is built upon the enhanced hardware framework. A data parallelism scheme separates
the workload within a training data partition and uses multiple workers to process the workload
in parallel.
The chapter is organised as follows. In the first part of the chapter, details of three enhance-
ments to the training framework will be given. The performance of the training data compres-
sion scheme will be evaluated. In the second half of the chapter, details of the data parallelism
based architecture will be given. The proposed architecture will be evaluated at the end of the
chapter.
4.1 Enhancement to the Framework
4.1.1 Adjustable Parameter mtry
According to the workflow in Figure. 3.1, the input to the sorting step and split step, Unsorted
data list is extracted from the training data partition after selecting one attribute. When mtry
4.1. Enhancement to the Framework 47
Unsorted 
datalist
Random attribute 
selection
PE
Two new 
partitions
Training 
data 
partition
Partition #1
(entire subset)
Queue
Can be split?
No
Yes
Queue empty?
Start DT training
Yes
End of DT training
No
Load next 
partition
Data extraction
Original 
training 
dataset
Sampling with 
replacement
Compare and 
update
mtry reached?
No
Yes
Figure 4.1: Workflow of DT training process with mtry > 1
48 Chapter 4. Enhancement to the Framework and Data Parallelism Based Architecture
Sorting 
module
Split module
Task queue
Task 
allocation 
module
Processing element
Memory 
array
Task queue Task queue
Results 
collection 
module
External memory
Training framework
…
Temporary
storage
Local 
comparison
Task queue
Sorting 
module
Split module
Processing element
Task queue Task queue
Temporary
storage
Local 
comparison
Task queue
Figure 4.2: Enhanced top-level architecture of training framework
is set to be d (d > 1), d attributes must be selected. Each attribute selected produces a specific
data list and each data list will produce local optimal split point. A global optimal split point
will then be determined after comparison. This process is depicted in a revised workflow in
Figure. 4.1. In the workflow an inner loop is set up, so that the steps from Data extraction to
Compare and update will be repeated until mtry repeat is reached. The Compare and update
step is in place to determine the global optimal split from the local ones.
In hardware the new inner loop in the workflow is implemented by executing the same task for
mtry times. An updated top-level architecture of the training framework is given in Figure. 4.2.
The architecture has two additional components. A Temporary storage is used to hold the data
relating to the task that is being processed. Before mtry repeats being reached, the task stored
in the temporary storage is always loaded by the PE with priority. A second component, Local
comparison is in place to hold the training results output from the split module (see Figure. 3.7).
Within the inner loop, whenever a new local optimal split point is produced, its split quality
measurement is compared to the current best one. The training results relating to the current
best split point get updated whenever a new best is obtained. After mtry repeats are done, the
training result left is the global optimal one.
4.1. Enhancement to the Framework 49
4.1.2 Support for Multiple-class (> 2) Classification
In Section. 3.1.5 calculation of the Gini impurity based objective function is optimised so as
to reduce the hardware resource utilisation in its implementation while maintaining a fast
output throughput. The optimisation originally proposed in [37] targets only binary classifi-
cation therefore the hardware framework introduced in the previous chapter does not support
multiple-classes (>2) classification. In this section a new optimisation scheme and correspond-
ing hardware implementation are introduced. The proposed hardware design maintains the
same output throughput as the one in [37] but supports any number of labels in the training
data.
New Optimisation Scheme
Recall that the objective function is defined as:
I = i(N)− PLi(NL)− PRi(NR) (4.1)
where N is the training data partition contained in a non-leaf node. NL and NR represent the
left and right partitions derived from N . PL and PR are the proportion of training instances
reaching the left and right branches respectively. i(·) is the Gini impurity measurement defined
as:
i(N) =
∑
i 6=j
PiPj =
1
2
[
1−
∑
j
P 2j
]
(4.2)
where Pi and Pj refer to the proportion of training instances with class i and j in the partition
N respectively. The split module aims to find the split point that produces the largest reduction
in the impurity i.e. the split point that produces the minimum Gini impurity measure after
the split, therefore only the weighted addition PLi(NL) + PRi(NR) in the objective function
needs to be calculated for comparison. Substitute i(·) in Equation 4.1 with Equation. 4.2 and
50 Chapter 4. Enhancement to the Framework and Data Parallelism Based Architecture
rearrange, now the objective function is redefined as the split quality measurement:
s′(·) = PL
{
1− [P 2(ω1) + P 2(ω2) + ...+ P 2(ωm)]}
+ PR
{
1− [P 2(ω′1) + P 2(ω′2) + ...+ P 2(ω′m)]} (4.3a)
= 1− PL
[
P 2(ω1) + P
2(ω2) + ...+ P
2(ωm)
]
− PR
[
P 2(ω′1) + P
2(ω′2) + ...+ P
2(ω′m)
]
(4.3b)
= 1− 1
n
[
n2(L,1) + n
2
(L,2) + ...+ n
2
(L,m)
nL
+
n2(R,1) + n
2
(R,2) + ...+ n
2
(R,m)
nR
] (4.3c)
P (ωm) and P (ω
′
m) are proportion of the instances with label m in the left and right branch
respectively. Further substitute notation P (·) in Equation 4.3b and rearrange, Equation. 4.3c is
obtained where n refers to the size of the data partition before the split, nL and nR are the sizes
of the left and right branch after the split and n(L,m) and n(R,m) are the number of instances
with label m in the left and right branch respectively. Since n is constant for all candidates of
split split for the same data partition, s′(·) can be further simplified as:
s(·) = 1
nL
[n2(L,1) + n
2
(L,2) + ...+ n
2
(L,m)]︸ ︷︷ ︸
sum of squares A
+
1
nR
[n2(R,1) + n
2
(R,2) + ...+ n
2
(R,m)]︸ ︷︷ ︸
sum of squares B
(4.4)
The split point that yields the maximum value of s(·) is the best one. A straightforward
implementation of Equation. 4.4 can be computationally intensive, however recall the workflow
of the split module, the entries of a data sequence arrives at the split module one by one, each
arrival produces a new candidate of the split point. That means for the sum of squares A
only one element n(L,j), j ∈ (1, 2, ...,m) gets changed at a time, each time n(L,j) decrements by
one. Since the entries leaving the left branch move to the right branch, similarly for the sum
of squares B, only one element n(R,j), j ∈ (1, 2, ...,m) increments by one at a time. The other
parts of the equation remain unchanged.
s(·) = 1
nL
(SA − 2n′(L,j) + 1)︸ ︷︷ ︸
sum of squares A
+
1
nR
(SB + 2n
′
(R,j) + 1)︸ ︷︷ ︸
sum of squares B
(4.5)
4.1. Enhancement to the Framework 51
SA
storage
Split quality measurement s(·)
Label 
histogram 
left / right
SA   / SB
1/n
ROM
Label 
histograms 
storage
Best I(·)
Best label 
histogram 
left / right
Part of training results
Figure 4.3: Improved split quality measurement in hardware
Based on this observation, Equation. 4.4 can be transformed to Equation. 4.5 where SA and
SB equal to the sum of squares A and B from the previous split point, n
′
(L,n) and n
′
(R,n) are
the label counts of the particular label that is to be updated in both branches. As a result the
implementation can be simplified by temporal storing of past values and calculating the sum
of squares progressively.
Hardware Implementation
The Split quality measurement component in the split module is modified in order to imple-
ment the Equation. 4.5. The architecture of the modified component is shown in Figure. 4.3.
Comparing with the original design in Figure. 3.7, extra storage components, SA/SB are added
in order to store the corresponding variables in Equation. 4.5. Both SA and SB are initialised
at the beginning of the search process. SB is initialised to be zero, SA on the other hand is
initialised by a dedicated SA storage which is located in the task allocation module. When a
global optimal split point is determined, the values of SA and SB relating to the split point
become the initial values for the newly generated partitions. These values are written back to
the SA storage and will be loaded in the following iterations.
52 Chapter 4. Enhancement to the Framework and Data Parallelism Based Architecture
a1d1
d2
a2
d3
d4
d5
a3
Figure 4.4: A training dataset of five examples mapped in feature space
4.1.3 Training Data Compression Scheme
The data compression scheme is proposed for reducing the memory space required by the
training data used at the training stage. Under the scheme, the training data that is sent to
the FPGA device is compressed before being stored in the embedded memory blocks, without
compromising the accuracy of the generated predicting model. The compression method takes
advantage of the internal workings of the training algorithm for the induction of decision trees,
resulting in a lossless compression. The freed memory space can be used to hold more training
data, or to allow extra parallel processing power to reduce the training time of the existing
problem.
Method
The idea behind the compression method exploits the fact that for axes-parallel weak classifier,
the real values of the training examples are not directly involved in the training process, i.e.
in the optimisation of objective function I, but are only needed in later stages to calculate the
threshold τ for the optimal weak learner φ that is found through the search in the split step.
An example is given in Figure. 4.4 to help elaborate the idea.
In the figure, a training dataset containing five training instances (with three attributes) are
mapped into a three-dimensional feature space. Two colours indicate the classes to which the
4.1. Enhancement to the Framework 53
Processing element
External Memory
Compression 
module
Sorter
Memory array
a1 a2 an
Original 
training data
Sorted 
indices 
in
Rank rn …...
Figure 4.5: Architecture of the compression module incorporated into the training framework
instances belong. d1...d5 are positions of the examples being projected onto the a3 three i.e.
the values of attribute a3. However, instead of the real values d1...d5, it is the relative positions
(ranking) of the examples on axis three that are involved in the optimisation. Once the optimal
weak learner is found, the threshold τ is then calculated by taking the median of two adjacent
real values based on the result of the optimisation.
The compression method works as follows. Given a training dataset (in the form of n × d,
n is the number of training instances, d is the number of attributes), consider each column
as a number list, replace the real values in the list with their corresponding sorted indices.
Same index is used for data with the same value. For the example in Figure. 4.4, the real
values d1, d2...d5 would be replaced by integer 1, 2...5. The decimal values in the data list
are now transformed into integer with the gap between each entry being fixed to one. From a
hardware prospective, the transformation will reduce the hardware resource needed to represent
the training data.
Hardware Implementation
Figure. 4.5 illustrates the way the compression module is integrated into the existing train-
ing framework. The Compression module is placed before the Memory array so the imported
54 Chapter 4. Enhancement to the Framework and Data Parallelism Based Architecture
Merge Sorter
Counter 2
Memory 
array
Counter 1
Values
Index in
val
addr
Values
Index in
Rank rn
Exported
Training dataset
a3 a2 a1
Figure 4.6: Hardware implementation of the compression scheme
training data from external memory are compressed before being stored. The compression
module and the training module share one sorter in the architecture in order to reduce resource
utilisation. The detailed hardware architecture of the compression module is depicted in Fig-
ure. 4.6. The training data are separated into several lists and enter the compression module
in a sequential manner. Each list contains the values belonging to one attribute. For each list
a counter #1 attaches an index in to each entry, recording its original position in the list. The
list is then sent to a sorter where the entries are repositioned with respect to their values. A
second counter #2 is in place to assign another index rn to each repositioned entries, indicating
its new position in the list (rank). rn is then used as the compressed data and are written
to the memory array, with in being the write address. By doing so the internal layout of the
training dataset before and after the compression remains the same.
Due to the compression, the optimal threshold τ corresponding to the optimal weak learner
φ is produced not in the form of real value but a pair of ranks. The real value threshold is
calculated by τ = (di + dj)/2, where di and dj are real values corresponding to the rank i and
j. A mechanism is designed to link i to di as illustrated in Figure. 4.7. Data list listvalues on
the right side stores the uncompressed training data d(i), data list listpointers on the left stores
4.1. Enhancement to the Framework 55
# Real value d(i)
1
2
3
n-1
n
d(1)4
...
4
#
1
2
3
n-1
n
4
...
Rank i
Sorted index in
Listpointers Listvalues
Figure 4.7: Mechanism linking compressed data i to real value data di
the sorted indices in that are previously used as write addresses for the memory block in the
compression module (indices that are exported in Figure. 4.6). The entries in listpointers serves
like pointers storing the locations of the real values, therefore given a rank i, the ith element
in listpointers contains the location of di stored in listvalues.
The architecture in Figure. 4.5 assumes that the compression module is incorporated into the
training framework and they are synthesised together. In this case the number of non-identical
values in the attribute number list is not known when determing the word-length for the
compressed training data. Therefore it is assumed that all the entries in the number list are
unique. The word-length is set to be long enough to represent n ranks, where n is equal to
the number of entries in the number list. Intuitively the word-length set could be longer than
necessary when duplicated values appear in the list. If in a different scenario the compression
process is performed before the synthesis of the training framework, the word-length can be
configured to bound only the number of non-identical values in the list, which guarantees the
minimum possible word-length. To distinguish the compression process in these two scenario,
the former one is named as long method while the latter one is referred to as short method. The
short method requires prior knowledge of the training dataset before synthesising the training
framework, which may not be available during the design time.
56 Chapter 4. Enhancement to the Framework and Data Parallelism Based Architecture
Table 4.1: Comparison of memory usage (bits per attribute per example)
Datasets Custom float Custom fixed Proposed-short Proposed-long
Sonar 14 15 8 8
Breast cancer 19 33 10 10
Ionosphere 19 19 9 9
Vowel 18 15 10 10
Segment 36 32 11 12
Waveform 14 12 10 13
4.1.4 Evaluation of Data Compression Scheme
The data compression scheme is evaluated by comparing against other potential compression
methods. The reference designs are based on custom precision number representations aiming at
the reduction of the required memory resource. The evaluation uses the same training datasets
that are used for the baseline architecture. By preprocessing the datasets, the required word-
length for custom floating point precision and fixed point precision were deducted along with
the word-length required under the proposed scheme. Table. 4.1 lists the required word-length
for representing a data value corresponding to one attribute of the dataset using three different
formats. The results for the proposed method include both short and long methods.
The word-lengths needed for the first two formats are calculated as follows. For the floating-
point format, the length of mantissa is the minimum length that meets the sufficient and
necessary condition for maintaining distinction among elements in the dataset [35], the length
of exponent is chosen to accommodate the dynamic range for all the values in the dataset.
Regarding the fixed-point format, the fraction part length is determined as to have enough
precision to represent the minimum distance between the sorted values in the dataset, the
integer length is determined based on the absolute maximum value in the dataset.
Note that the word-length based on both floating-point and fixed-point formats depends on the
range or the precision of the values in the dataset whereas for the compression scheme it is only
proportional to the number of (different) elements in the dataset. The compression is more
effective when the distribution of the values is either extremely sparse or extremely dense. From
4.2. Data Parallelism Based Architecture 57
the table, reduction for the first four datasets ranges from 33% to 53% 1. For the following two
datasets with relatively larger size, ”Segment” has both high precision and large range so after
compression 66% of the word-length can be reduced, however for ”Waveform” due to larger size
of the dataset and more dense distribution of the elements, long method is no longer efficient
when compared with floating-point format. short method on the other side is able to reduce
17% of the word-length and guarantees the minimum word-length in all circumstances.
4.2 Data Parallelism Based Architecture
In the previous section, three enhancements to the training framework are introduced. One of
the enhancements is allowing an adjustable parameter mtry. When mtry is set to be greater
than one, multiple attributes must be randomly selected and assessed during the search for
the optimal split point. Due to this enhancement, however, the task parallelism based training
architecture introduced in the previous chapter is no longer suitable for the improved framework.
Because in the task parallelism based design, the split of one training data partition is completed
within one PE, meanwhile each PE has access to the values of only a part of attributes. As a
result the design would only work when there is only one PE instantiated in the architecture,
or the randomly selected attributes never fall outside of the accessible range of the PE, which
is not the case for the uniformly distributed random integer generator used in the attribute
selector.
To address this problem, a different parallel processing scheme i.e. data parallelism scheme is
explored for the enhanced training framework. The difference between data parallelism and
task parallelism is that, in task parallelism, workload is separated with respect to the training
data partitions located on the same level in decision tree, i.e. each parallel worker is responsible
for one partition. While in data parallelism, the workload within a single partition is separated
and processed by parallel workers. A new pair of task allocation module and results collection
module is designed to support this parallel scheme.
1The results are based on comparison between min{float, fixed} and min{short, long}
58 Chapter 4. Enhancement to the Framework and Data Parallelism Based Architecture
4.2.1 Overview of Data Parallelism Scheme in Hardware
An updated workflow that adopts the vertical data parallelism is depicted in Figure. 4.8 2. The
workflow is built on the fundamental workflow for the enhanced framework shown in Figure. 4.1.
Each training data partition loaded from the Queue produces n (n = mtry) different Unsorted
data list, the data lists are then allocated to the PEs. Given the fact that the number of PE
could be less than mtry, each PE may receive more than one data list. In the figure, the number
of the data lists that a PE needs to process is referred to as Local-mtry. Now each PE starts
to execute its own inner loop and determine the local optimal split point. After completing
the work, the local training results are aggregated, in which a global optimal split point is
produced. The new partitions generated based on the optimal split point is then sent back to
the queue and waits for further split.
Similar to the task parallelism based design, which PE the input data list is allocated to is
dependent on which attribute has been selected for the data list. Each PE in the architecture
is designed to be able to process a data list within a specific range of attributes. For instance,
if there are two PEs in place and the training data contains 10 attributes in total, then PE #1
will be responsible for data list with attribute #1-#5 and PE #2 will be responsible for data
list with attribute #6-#10.
In the following sections, details of task allocation module and results collection module will
be given.
4.2.2 Data Parallelism Based Task Allocation Module
The task allocation module is responsible for assigning the workload within a single training
partition to different PEs. The workload is separated with respect to different attributes se-
lected. The allocation module determines which PE the workload with a specific attribute
should be sent to. The architecture of the task allocation module is shown in Figure. 4.9. Note
2The architecture introduced in Section. 4.2 does not include the data compression scheme proposed in
Section. 4.1.3
4.2. Data Parallelism Based Architecture 59
Unsorted 
datalist
Random attribute 
selection
PE
Two new 
partitions
Training 
data 
partition
Partition #1
(entire subset)
Queue
Can be split?
No
Yes
Queue empty?
Start DT training
Yes
End of DT training
No
Load next 
partition
Data extraction
Original 
training 
dataset
Sampling with 
replacement
Compare and 
update
Local-mtry 
reached?
No
Yes
Allocation
Local 
queue #1
…...
PE
Two new 
partitions
Compare and 
update
Local-mtry 
reached?
No
Yes
Local 
queue #1
Aggregation
Figure 4.8: Vertical data parallelism based workflow of DT training process
60 Chapter 4. Enhancement to the Framework and Data Parallelism Based Architecture
Size #n
Attribute 
selection
Task FIFO structure
Size #1
Initialisation
PE selection
Task allocation module
...
Size FIFO
Label 
histograms #n
Label 
histograms #1
...
Label histograms FIFO
Attribute 
#2
Attribute 
#local-mtry
Attribute 
#1...
Local mtry
Attribute queue
Size
Label 
histograms
Task data
Processing element #n
Figure 4.9: Architecture of data parallelism based task allocation module
that among the components of the task allocation module, a small part is actually located in
the PE. The main component in the task allocator is a FIFO structure that is used for queuing
the tasks. Each entry in the task FIFO structure is a package of two data, Size indicates the
number of training instances contained in the training data partition, Label histogram is used to
initialise the label histograms in the split quality measurement component in the split module.
The first task in the queue is initialised as follows. Since the first training data partition in
the workflow is the entire subset sampled from the original training dataset, the first task is
generated by an Initialisation component that collects data from the first partition including
the size and the label histograms. Each task loaded from the queue will produce n (n = mtry)
input data lists (n different workload), based on the attribute selected, PE selection component
determines which PE the task data should be sent to. Meanwhile, the attribute value itself is
also sent to the PE that is just selected. In the PE, a Attribute queue is in place to receive the
attribute values. Once the assignment of the workload is completed, each PE will count the
total number of attribute values it has received in order to decide Local-mtry, i.e. how many
different input data lists it has to process. At this point, the job of the task allocation module
is done.
4.2. Data Parallelism Based Architecture 61
Training results 
from various PEs
Task queue
New tasks 
generation
Training results 
export
Results collection module
Counter
Comparison
Best results
Global
Index FIFO
Index FIFO
Figure 4.10: Architecture of data parallelism based results collection module
4.2.3 Data Parallelism Based Results Collection Module
The local optimal split points and related training results from all the PEs are aggregated in
the results collection module. In the module, the comparison is performed among the local
optimal results in order to determine the global optimal one. After that, the newly generated
partitions based on the optimal training results will be sent back to the task queue.
The architecture of the results collection module is in Figure. 3.10. When all the PEs complete
the job, starting from the first PE, the results collection module imports the training results
from the PE, hence the multiplexer is controlled by a counter. Whenever a new result arrives,
in Comparison component it is compared with the previous results (if any). The current best
result is temporarily stored in the Best result component. After all the PEs are accessed, the
results left in the Best result component is the global optimal results, based on which new tasks
are generate and written back to the task queue, mean while the training results relating to
the optimal split point are exported. A FIFO component, Global index FIFO, is in place to
receive the indices from the local index FIFO located in the PE that produces the best split
point. The indices in the Global index FIFO will be used to generate the read address in the
memory array.
62 Chapter 4. Enhancement to the Framework and Data Parallelism Based Architecture
Table 4.2: Hardware utilisation of data parallelism based architecture
Max number
of training instances
LUTs Registers Logic
Memory
bits
Memory
M9Ks
Memory
M144Ks
256 4403 (1) 5519 (1) (2) 41608 14 (2) -
512 5030 (1) 6308 (1) (2) 89744 20 (2) -
1024 5732 (1) 7309 (2) (2) 195736 25 (2) 1 (2)
2048 6437 (2) 8147 (2) (3) 428960 42 (4) 2 (3)
4096 7168 (2) 9104 (2) (3) 925864 64 (5) 5 (8)
8192 7536 (2) 7928 (2) (3) 1464496 56 (5) 10 (6)
4.2.4 Evaluation of Data Parallelism Based Architecture
The evaluated architecture does not include the data compression scheme previously introduced.
Scalability
The scalability of the architecture has a similar characteristic to the baseline design. It is
proportional to the number of entries, attributes and labels in the training data as well as the
number of PEs instantiated.
In Table. 4.2 the hardware utilisation with respect to the maximum number of training instances
allowed in the training dataset is given, with the number of attributes and label being fixed
to eight and two respectively. The word-length of each training entry is eight bits long. The
utilisation as percentage of total capacity based on Altera EP4SGX530HF35C2 FPGA is also
given in the brackets. The column Logic refers to total utilisation of logic fabric. The number
of attribute is fixed to a relatively small number since a training dataset with more attributes
can be processed by multiple PEs, each taking responsibility for a subset of the attributes.
As the size of the training data grows, the embedded memory utilisation outpaces that of the
logic fabric (LUTs and registers) in terms of the percentage of total capacity in the device,
making logic fabric the bottleneck of the scalability. This imbalance in resource utilisation can
be offset by instantiating multiple PEs. When configure the architecture, a proper number of
PEs should be chosen to reach a balance between the utilisation of logic fabric and embedded
4.3. Conclusion 63
Table 4.3: Training time for data parallelism based architecture
R
scikit-learn
(single core/duo cores)
Proposed
4.51 s 3.57 s / 1.77 s 43.7 ms
memory. Then the entire architecture as a component can be duplicated so as to train multiple
decision trees in parallel.
Training Speed
The testing environment for the training speed is similar to the one used for the task parallelism
based architecture in Chapter 3. The training speed based on Sonar dataset (The content in
the dataset is duplicated 10 times) is compared to the randomForest v4.6.10 package in R and
scikit-learn v0.15 package in Python. The parameter mtry is set to be
√
d where d is the number
of attributes. The working frequency is set to be 100MHz as a safe choice in order to avoid
timing violation, the actual maximum working frequency achieved is slightly over 100MHz. The
average training time obtained is given in Table. 4.3. Note that for randomForest package in R,
the training time for mtry =
√
n is shorter than mtry = 1. The proposed architecture achieves
speed up of around 103× and up to 82× for R implementation and scikit-learn implementation
respectively.
4.3 Conclusion
In this chapter the hardware training framework is enhanced by adding support for an adjustable
parameter of mtry and adding capability of handling more than two class labels. In addition
a data compression scheme is proposed to optimise the storage of training data in FPGA. The
flexible mtry automatically enables the vertical data parallelism in the RF training process
therefore a second training architecture exploiting this feature is proposed in the chapter.
Similar to the task parallelism, the data parallelism based architecture also achieves speed up of
up to two orders of magnitude when compared with GPP implementations. The enhancements
64 Chapter 4. Enhancement to the Framework and Data Parallelism Based Architecture
to the training framework extend the range of the training problems to which the proposed
work can be applied to, meanwhile the new architecture effectively take advantage of the new
layer of parallelism and achieve acceleration in the training process.
Chapter 5
Incremental Training Architecture
The training architectures introduced in the previous chapters are built upon the conven-
tional Random Forest (RF) training algorithm which requires access to entire training dataset
throughout the training process. Maximum embedded memory space in mainstream FPGAs
stay around several tens of mega bytes1. The training architectures introduced in the previous
chapters require that the entire training dataset must be stored in FPGA. However training
problems can often be too large for the embedded memory capacity of modern FPGA, which sig-
nificantly limits the range of application that the training architectures can be used for. Hence
it is desired to enable the architecture to load and process the training dataset in batches. To
this end, in this chapter a training architecture is introduced to allow incremental training of
an RF model in FPGA. Towards this goal, the proposed architecture incorporates the Hoeffd-
ing tree algorithm introduced in VFDT [20] which was originally designed to target decision
tree (DT) training on streaming training data. In the new architecture, the idea behind the
Hoeffding tree will be used to handle the training data arriving to the FPGA by batches. Note
that although by using the incremental approach, the training data does not have to be moved
to the FPGA in one go, the resulting RF model still reflects the statistics of the entire training
dataset. That means the training process does not forget any old data.
18MB in Altera Stratix V GX series and 54MB in Xilinx Virtex UltraScale+ series
65
66 Chapter 5. Incremental Training Architecture
Hoeffding tree 
training
Original 
dataset
Bootstrap 
sample
One batch
Random attribute 
selection
Sub-tree
Predicting
FIFO
#0
#1 #2 #3 #4
Sub-tree
Decision tree 
grown so far
Sorting dataset #0 
through sub-tree to 
subset #1-4
Leaf-node
Root-node
#1 #2 #3 #4
Queue
Take a batch with 
specified size
Training dataset 
reaching a leaf-node
Figure 5.1: Incremental training workflow in hardware
5.1 Incorporation of the Hoeffding Tree Algorithm in
the RF Training
The Hoeffding tree algorithm makes it possible to train a decision tree without having access
to the entire training dataset. The same algorithm can be used to train the decision trees that
comprise the RF ensemble after adding the randomness that is required. In Figure. 5.1 the
DT training workflow incorporating Hoeffding tree algorithm is depicted. Each decision tree is
trained on a bootstrap sample (sampling with replacement) of the original dataset.
The training process starts by sending the first batch of training data to the Hoeffding tree
training process, bypassing the predicting process since at this point no tree has been induced
yet. The size of the batch is decided based on the scaling of the system. A Hoeffding tree
is trained by following the standard DT training procedure with the sole exception of the
stopping criteria. The stopping criteria follows the measurement of Hoeffding bound defined
in Section. 2.1.3 instead of a full growth i.e. split is only stopped when either there is only one
instance left in the data partition or the entries left in the data partition have the same label.
Once the split is stopped, the partially grown tree (referred to as sub-tree in the figure) is then
5.2. Hardware Architecture 67
Training dataset
Pointers to training 
dataset
Training
module
External memory access module
Predicting
module
SDRAM
FPGA
Results 
storage
ptr trn data ptrtrn data
Head 
addrs of 
pointers
Figure 5.2: Top-level architecture of incremental training architecture
sent to the predicting process where the rest of the training instance located in the root node
is traversed through the sub-tree to its leaf-nodes.
Now consider each of the leaf-nodes as a new root node, each holding a subset of the original
training dataset (dataset #1-4). For each new root-node that holds any training data (The one
that does not hold any training data will no longer be split), a new batch is taken from the
subset fed to the Hoeffding tree training process, producing a new sub-tree. This is followed by
a new round of predicting process and so on so forth. The cycled workflow then continues until
no more leaf-nodes can be further split according to the measurement of the Hoeffding bound.
The final decision tree is constructed by linking together the sub-trees (This step is not done
in FPGA).
5.2 Hardware Architecture
5.2.1 Overview
Following the workflow introduced in the previous section, the top-level hardware architecture is
shown in Figure. 5.2. The architecture consists of four main components. The training module
and the predicting module implement the corresponding Hoeffding tree training and Predicting
processes in the workflow. An external memory access (EMA) module is in place to manage
68 Chapter 5. Incremental Training Architecture
the data transmission between the training and predicting modules and external memory. The
training results from the training module are parsed and stored in the Results storage module
in which the links (paths) between different nodes are created in the form of linked list.
There are three parts of data being stored in the external memory. The first part is the original
training dataset. The second part is pointers. Each pointer represents one training instance
in the training dataset. It stores the starting address of an instance since each instance may
consume multiple memory words. Pointers are organised in groups. The training starts with
a single group of pointers representing the entire training dataset, as the training goes on, the
original group is divided into many smaller ones, each representing a subset of the original
training dataset that reaches one of the leaf-node of a sub-tree. To mark the boundaries of
each group, its starting and ending addresses are explicitly recorded as the third part of the
data.
Throughout the incremental training process, the cycled workflow involves a number of rounds
of data transmission from and to the external memory. Each round contains five steps. Firstly
the EMA module locates a group of pointers by loading its head address. Secondly a specified
number of pointers are read where the number is equal to the size of a batch. Based on the
addresses contained in the pointers, the corresponding training data are read into the memory
array in the training module in which a sub-tree will be trained. After the training is completed
the third step is performed where the rest of the pointers left in the group (if any) are read into
the predicting module in batches. For each batch the module sorts the pointers through the
sub-tree while reading the values of specified attributes in the training dataset. Once completed,
the members in the group now reach different leaf nodes in the sub-tree, hence is divided into
multiple smaller groups. The fourth step is performed in which the pointer groups that can be
further split are written back to the external memory, waiting to be processed in the following
round. Finally the head addresses of new pointer groups are also written back to the external
memory. This completes the final step.
The training module is built upon the data parallelism based architecture introduced in Chap-
ter 4 with the Hoeffding bound based stopping criteria. A Hoeffding bound measurement is
5.2. Hardware Architecture 69
Link info.
Pointers 
group 1
Link info.
Pointers 
group 2
Addr = 0
Addr = n
Link info.
Pointers 
group 3
Link info.
Pointers 
group 2
Addr = n+1
Addr = 2n
Link info.
Pointers 
group 1
End mark
Pointers 
group 3
Addr = 2n+1
Addr = 3n
End mark
Pointers 
group 2
End mark
Pointers 
group 1
Addr = 3n+1
Addr = 4n
Figure 5.3: Mapping of pointer groups in external memory
implemented to determine whether the growth of a decision tree should be stopped. The addi-
tional part is introduced in detail in the following section along with the results storage module
and the predicting module.
5.2.2 Mapping in External Memory and EMA
Both training module and predicting module have access to the external memory through an
external memory access module. Apart from being a driver that issues read and write requests to
the external memory controller, the EMA module also manages various read and write requests
from different modules throughout the training process according to the workflow depicted in
Figure. 5.1. As mentioned in Section. 5.2.1 there are three parts of data stored in the external
memory i.e. the training data, pointer groups and head address for each pointer group. The
training data are stored in the memory in a sequential manner. Since an instance includes many
attributes, the mapping follows a pattern v1,1, v1,2...v1,n, v2,1, v2,2...vm,n where m and n refer to
the instance number and the attribute number respectively. Similarly for the starting/ending
address pair, different pairs are stored in consecutive addresses in the memory. The mapping
of the pointer group, on the other hand, follows the pattern illustrated in Figure. 5.3. The
part of memory that stores the pointers is separated into many blocks. Each block contains
the same number of entries which is equal to that contained in one batch of training data
(appears in Figure. 5.1). The pointers stored in a block always belong to the same group. A
70 Chapter 5. Incremental Training Architecture
Global best s(a)
Global second best s(b) -
x
ROM
1/R
x
c
cf.
<<
Stopping flag
Global 
comparator
Figure 5.4: Hardware implementation of Hoeffding bound measurement
group of pointers are stored in a number of blocks that are connected in the form of linked list
(Blocks with the same colour in the figure). The last word of each block, except for the last
block, is spared to contain link information i.e. the starting address of the successor block. In
the last block of each group, an extra End mark is added after the last pointer. The mark is
used in the predicting module to recognise the end of a group. The mapping of the pointers
is designed to work with the way the predicting module sends pointers back to the external
memory. The pointers are processed in the module in batches. The pointer in each batch,
after being traversed through the sub-tree, fall into different groups. The final size of each
group is unknown until the entire predicting process is complete. A straightforward way to
map the pointers is preallocating the maximum possible size for each group (i.e. the size of
entire training dataset), which inevitably leaves a large part of the memory space unused.
On the other hand, grouping the pointers completely through linked list can avoid waste of
memory space due to preallocation however since the destination of each pointer after each
batch is random, a potentially sparse mapping in the memory requires a large amount of link
information to be stored in the memory as well, in addition, when these pointers are read in
the following process, the mapping can cause frequent accesses to non-consecutive addresses in
the memory, which is not preferred. In comparison the mapping scheme proposed is flexible to
reach a balance between the previous two arrangements by adjusting the size of the blocks.
5.2.3 Hoeffding Bound Measurement
Recall that in Figure. 4.1, whether a new data partition can be split is determined by a stop-
ping criterion. In the Hoeffding tree training process, the stopping criteria is defined by a
5.2. Hardware Architecture 71
Level n
Level n+1
Node αLeft-successor of 
Node α 
Figure 5.5: Illustration of parsing training results
Hoeffding bound measurement which is shown in Equation. 2.9. Its straightforward hardware
implementation is depicted in Figure. 5.4 and is used to replace the conventional stopping cri-
teria implementation in the split module. It takes as inputs the best and second best split
scheme measurements from the global comparator (which is depicted in Figure. 4.10). The
difference between two measurements is squared before having a one bit shift which produces
the left hand side of the equation. On the other side of the equation, the squared R and ln(1/δ)
are pre-set constants, their multiplication is referred to as input c in the figure. The value of
the variable 1/n is taken directly from the ROM 1/R (located in the split module) where all
possible values of 1/n are pre-calculated and stored. Result of the comparison is a flag signal
informing the system whether the current data partition can be split. A data partition with
the unset flag will not be pushed back to the cycled workflow, the sub-tree stops growing when
all the data partitions can no longer be split.
5.2.4 Results Storage Module
Results storage module stores the sequence of the training results from the training module.
Each instance of the results corresponds to a node in the sub-tree and contains a set of three
values including attribute, threshold and node type. The results do not contain information
regarding the links between different nodes in the sub-tree therefore they can not be used
directly by the predicting module. A parsing process is performed on the training results in
the module to create the necessary links in terms of memory addresses.
72 Chapter 5. Incremental Training Architecture
A sub-decision tree is shown in Figure. 5.5 to help explain the parsing process. The entries in
the results sequence arrive in breadth-first order meaning that the results for the nodes on level
n start to arrive after all the nodes on level n− 1 have been received. On each level the results
arrive one after another starting from the left-most node. To create the links, apart from the
existing information contained in a result entry, during the parsing process, a fourth value offset
is added to each entry, so that the memory location of the left successor node for the current
node is equal to current memory address + offset. The right successor node, if exists, is always
stored following the left one. An offset value is determined by numcrntLv − numrcv + numnxtLv
where numcrntLv refers to the total number of nodes to be expected on the current level (level
n), numrcv refers to the number of nodes on the current level that have been received, numnxtLv
refers to the number of nodes that have been scheduled so far for the next level.
numrcv increments whenever an entry arrives. Depending on the type of the arriving node,
numnxtLv is updated in one of the three ways. As shown in the figure, there are three different
types of nodes in a sub-tree. A round-shaped node is a non-leaf node, it will always split into two
successor nodes on the next level, therefore numnxtLv will increment by two. A square-shaped
node is a local leaf node that produces no successor nodes, so numnxtLv remains the same value.
A triangle-shaped node is a dummy node, it is created in the rare cases where no split scheme
can be found based on the selected attributes (the data points on the selected attributes have
the same value). A dummy node always produce one successor node so numnxtLv increments
by one. when numrcv reaches numcrntLv, it is known that a whole level of results have been
received, as a result numnxtLv will be reset to zero before its value being copied to numcrntLv,
at the same time numrcv is also reset to zero. For the example in the figure, when node α
is sent to the storage, numcrntLv, numrcv and numnxtLv are 5, 3 and 3 respectively so its left
successor node can be reached with offset value 5.
The implementation of the module is shown in Figure. 5.6. It consists of the counters corre-
sponding to the three variables introduced above. Flag signals Leaf/Non-leaf check and Dummy
node check are the inputs based on which the counter numnxtLv will increment. The offset value
produced is concatenated with the entry of the training results before being written to a cluster
of memory blocks. The blocks contain the same duplicated content and will be accessed by the
5.2. Hardware Architecture 73
Counter
(num_crntLv)
Counter
(num_nxtLv)
Counter
(num_rcv)
Leaf/Non-leaf check
(from training module)
Dummy node check
(from training module)
+
Offset
Entry of training results
(from training module)
RAM #1
RAM #2
RAM #3
Training results
(to predicting 
module)
Training results
(to ext. memory)
Figure 5.6: Hardware implementation of results storage module
predicting module in parallel. An extra data bus is used to export the training results to the
external memory.
5.2.5 Predicting Module
The predicting module sorts a partition of the training data through the sub-decision tree. It
takes three sets of inputs. Firstly a group of pointers are imported from the external memory.
Each pointer indicates the head address of an entry in the external memory (i.e. the memory
location of the value corresponding to the first attribute of the instance). Each pointer acts
as an entry ID, informing the module which instance in the training dataset is under process.
The second set of inputs are the training results i.e. attribute/threshold pairs that define the
sub-tree. Finally as an instance moves downwards the tree through different nodes, the values
of the specified attributes are imported from the external memory and are compared with the
thresholds to determine which nodes to go next. Once all the entries are sorted, they reach
different local leaf nodes in the sub-tree and form various new partitions. Their corresponding
pointers, now separated in different partitions, are in turn sent back to the external memory,
waiting to be loaded by the training module in the following rounds.
The hardware implementation of the predicting module processes a batch of entries in parallel
as depicted in Figure. 5.7. The architecture consists of a group of units, each one being able to
move a single instance downwards a sub-tree. The process for each batch starts by importing a
group of pointers through Pointers import controller. The pointers are distributed to Pointer
registers in each unit through a shift register. Then, starting from the root node, each unit
74 Chapter 5. Incremental Training Architecture
Unit #n
Unit #1
Read address
Training results
(attribute/threshold pair)
RAM #1
 (located in results 
storage module)
… ...
Values import 
controller
Values of training data
(from ext. memory)
Read related signals
(to ext. memory)
Results address 
generator
Memory block
Value of specified 
attribute
cf.
Availability
check
RAM #n
 (located in results 
storage module)
Values of
training data
Pointers import 
controller
Pointers of training data
(from ext. memory)
Read related signals
(to ext. memory)
Pointers export 
controller
Pointers of training data
(to ext. memory)
Write related signals
(to ext. memory)
Leaf ID reg.
Pointers of 
training data
Pointer reg.
Read address
Training results
(threshold)
Unit #n-1
Unit #n+1
Training results
(attribute)
Values of
training data
Values read 
address
Values read 
address
Unit #n
Pointers + 
local leaf node IDs
Figure 5.7: Hardware implementation of predicting module
5.3. Evaluation 75
reads the first attribute/threshold pair from the results storage module by producing the read
address in Results address generator. The attribute received is used as an offset to the head
address stored in the pointer register, together producing a Values read address to Values
import controller. The controller in turn fetches the right value from the external memory
and write it to Memory block in the unit. The value of the specified attribute is then loaded
and compared to the threshold received. Its result determines the next read address to be
produced in the results address generator. After a set of comparisons, the instance reaches
its local leaf node, the ID of the leaf node (in the leaf node the attribute/threshold pair is
replaced by an ID value) is then written to a Leaf ID register. The leaf ID registers in different
units are connected together forming a shift register. Once all the entries in the batch reach
their local leaf nodes, it is ready to export their pointers back to the external memory. The
two shift registers containing pointers and leaf IDs send one pair of values to Pointers export
controller at a time. The pointer, depending on its leaf ID, is written to a specified location in
the external memory. The details on the mapping of the pointers in the memory is introduced
in Section. 5.2.2.
In each unit an Availability check component is in place to check whether the required values
of specified attributes have been previously imported from the external memory. This is due to
the difference in word-length of the data bus between external memory side and the FPGA local
side so that a single read can import multiple values of different attributes. The availability of
the required value is checked before issuing the read request to the values import controller, if
the value is already in the memory block then it will proceed to comparison directly. Doing so
can potentially save a part of processing time for external memory access.
76 Chapter 5. Incremental Training Architecture
Table 5.1: Summary of architecture configuration
Overall
Data-width of training instances 8-bits
Number of attributes 8
Number of labels 2
Training module
Number of processing elements 1
Results storage
Max number of nodes allowed 2 × Number of training instances allowed
Number of training results duplicates 1
Predicting module
Number of predicting units 1
Table 5.2: Hardware utilisation of training module
Max number
of training instances
LUTs Registers Logic
Memory
bits
Memory
M9Ks
Memory
M144Ks
256 5022 (1) 5119 (1) (2) 40192 (<1) 11 (<1) -
512 5677 (1) 5841 (1) (2) 69376 (<1) 13 (1) -
1024 6313 (1) 6419 (2) (2) 141312 (<1) 23 (2) -
2048 6967 (2) 7048 (2) (2) 285440 (1) 25 (2) 1 (2)
4096 7641 (2) 7706 (2) (3) 574976 (3) 45 (4) 2 (3)
8192 8416 (2) 8403 (2) (3) 1157376 (5) 36 (3) 7 (11)
5.3 Evaluation
5.3.1 Scalability
The hardware resource utilisation obtained in this section is based on the configuration sum-
marised in Table. 5.1. The resources utilisation for the training module with respect to the
maximum number of training instances allowed is given in Table. 5.2. The figures in the brack-
ets are the utilisation as percentage of total capacity based on Altera EP4SGX530HF35C2
FPGA. The column Logic refers to the percentage of logic fabric utilised in the device. It
can be observed that for the device under experiment, as the maximum number of training
instances increases, the total memory consumption (5%) outpaces that of logic fabric (3%).
5.3. Evaluation 77
Table 5.3: Hardware utilisation of results storage
Max number
of training instances
LUTs Registers Logic
Memory
bits
Memory
M9Ks
Memory
M144Ks
256 82 (<1) 45 (<1) (<1) 16384 (<1) 2 (<1) -
512 91 (<1) 50 (<1) (<1) 32768 (<1) 4 (<1) -
1024 98 (<1) 55 (<1) (<1) 65536 (<1) 8(<1) -
2048 105 (<1) 60 (<1) (<1) 131072 (<1) - 1 (2)
4096 113 (<1) 65 (<1) (<1) 262144 (1) - 2 (3)
8192 120 (<1) 70 (<1) (<1) 524288 (2) - 4 (6)
Table 5.4: Hardware utilisation of predicting module
LUTs Registers Logic
Memory
bits
Memory
ALUTs
Predicting unit 112 (<1) 1232 (<1) (<1) 1024(<1) 260(3)
Total 48 (<1) 854 (<1) (<1) 1024(<1) 256(3)
Table. 5.3 shows the resource utilisation of the results storage module with respect to the max-
imum number of training instances allowed by the training module. The resource consumption
is dominated by the embedded memory resource. The memory space needed is proportional to
several factors including the word-length of each result entry, the maximum number of nodes
(result entries) allowed as well as the number of predicting units instantiated in the predicting
module (since each predicting unit is connected by one duplicate of the entire training results).
The word-length of each result entry is further proportional to the data-width of the training
instances, the number of attributes and the maximum number of nodes allowed.
In Table. 5.4 the resource utilisation of the predicting module and the predicting unit are given.
Scaling of the predicting unit is bounded by external memory bandwidth. The number of the
units is determined so that the external memory bandwidth is saturated during the predicting
process as illustrated in Figure. 5.8. The saturation is identified by counting, during a period
of 240ms, the number of readings performed by the Value import controller. The figures in
Figure. 5.8 based on the assumption that during the predicting process, each non-leaf node
passed by an input will create a value import request. Due to the existence of a memory block
storing pre-fetched training values in each predicting unit, the actual number of units that leads
78 Chapter 5. Incremental Training Architecture
1 2 3 4 5 6 7 8 9 10
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
Number of predicting units
N
u
m
b
er
 o
f 
25
6−
b
it
s 
w
o
rd
s 
re
ad
Figure 5.8: External memory access with respect to the number of predicting units
to saturation will be higher than the figures shown in the Figure. 5.8. In practice to what extent
the actual number is higher than the experiment results is affected by several factors including
the number of attributes in the training data, the number of attributes that can be stored in a
single external memory word and the structure of the decision tree classifier, therefore is highly
dependent on each unique case.
Scaling of various modules involved in the training architecture uses the following rules. The
number of predicting units instantiated in the predicting module is determined first to saturate
the external memory bandwidth during the predicting process. The same number of memory
storages that store the duplicates of the training results is then needed. The rest of the hardware
resources are used for the training module. The size of a batch of training dataset allowed in
the module should be set as large as possible in order to increase the chance of successful
split (since the Hoeffding bound threshold  decreases monotonically as the number of entries
observed increases). Finally the PEs in the training module are fully scaled to consume the
rest of the hardware resource.
5.3. Evaluation 79
Table 5.5: Properties of covertype dataset
Number of instances Number of attributes Number of labels
581k 57 7
Table 5.6: Comparison results for training speed of incremental architecture
scikit-learn (Python,CPU) GPU(Titan) GPU(C2075) Proposed
463 s 67 s 125 s 118 s
5.3.2 Training Speed
The training speed of the incremental training architecture is obtained from estimation using
the formula:
ttotal =
∑
(ttrans(·) + ttrain(·) +
∑
tpredict(·)︸ ︷︷ ︸
one batch
+tw.ptr(·))︸ ︷︷ ︸
one task
(5.1)
The total training time is the sum of the processing time for each task that is defined in
Section. 5.2.1. Each task contains several elements. ttrans(·) is the time spent on moving from
the external memory to the embedded memory a subset of the training dataset. ttrain(·) is the
time consumed for growing a decision tree out of the subset.
When the decision tree is induced, it takes tpredict(·) to predict and assign the remaining training
instances in the task to one of the pointer groups and takes tw.ptr to send the assignment results
back to the external memory. These two operations are pipelined, therefore most of the external
memory access (i.e. tw.ptr) is hidden behind the predicting process.
The total time consumed then becomes
∑
tpredict(·) + tw.ptr(·) in which only the last batch of
tw.ptr is added. For similar reason some elements are not included in Equation. 5.1 such as the
external memory access regarding the pointers (i.e. reading pointers) and reading/writing the
task informations, it is because the time consumed for the operations are hidden due to the
assumption of using two external memory channels (as equipped in the FPGA board in the
experiment).
The training speed estimation based on Equation. 5.1, Altera Stratix IV EP4SGX530HF35C2
80 Chapter 5. Incremental Training Architecture
Figure 5.9: Typical static power consumption of Stratix IV EP4SE680 FPGA
FPGA and the working frequency of 200MHz is applied to the covertype dataset whose proper-
ties are listed in Table. 5.5. The working frequency is achieved after optimising the circuit for
timing violation. The training speed estimation obtained is compared to the results reported
in [29], which are obtained based on several different implementations running on various CPU
and GPU. The training configuration in the proposed architecture follows the one described
in [29]. The comparison results are listed in Table. 5.6. The proposed architecture is able to
achieve a training speed that is comparable to a modest 448-cores GPU implementation and
compares favourably with a Python-based implementation running on a 6-cores 2.3GHz CPU,
it is about 76% slower than the state of art 2688-cores GPU(Nvidia Titan). It is worth noting
that the FPGA under experiment is a relatively old product (released in 2008) when compared
with the Nvidia C2705 GPU (released in 2011) and Nvidia Titan series (released in 2013). In
terms of Performance / Power consumption ratio, FPGA implementation has great advantage
due to its relatively lower power consumption. The total power consumption of the FPGA is
a combination of static power consumption and dynamic power consumption. Figure. 5.9 is
quoted from [53] and shows the typical static power consumption of an FPGA that is similar
to the one under test, the power consumption in the figure is for reference only. The FPGA
under test contains 531K logic elements. Assuming that the logic elements in the FPGA are
fully utilised and the dynamic power consumption is twice the static power consumption, the
estimation of the total power consumption is around 10W. On the other hand, the Nvidia Titan
GPU has a power consumption of 250W and requires minimum system power of 600W.
5.4. Conclusion 81
The following factors limit the training speed of the proposed architecture and cause it to be
outperformed by the GPU (Titan) implementation. First of all the working frequency of the
FPGA under test is 200MHz compared to 836MHz at which the GPU(Titan) is running at.
200MHz is achieved after certain degree of hardware optimisation. It is possible to further
improve the timing by further optimising the hardware at gate level, however once the working
frequency is over 600MHz the training performance can no longer be further improved. This
is because the on-chip memory in the FPGA under test can not work at a frequency beyond
600MHz [10]. Since the training process is data intensive and is highly data dependent, even if
the part not involving embedded memory access can work at a higher frequency in a separate
clock domain, the embedded memory access would become a system bottleneck, meaning that
the total training performance will not be effectively improved. Secondly the FPGA architecture
is associated with DDR2 memory running at 400MHz which offers a maximum theoretical
bandwidth of 102Gbps [50], while Titan is equipped with DDR5 memory running at 1502MHz
which offers maximum theoretical bandwidth of 2304Gbps [39]. Having higher external memory
bandwidth would reduce ttrans as well as tpredict in Equation. 5.1. Finally the training speed of
the FPGA architecture can be improved by allowing larger batch of training data in FPGA.
By increasing the size of each batch, the number of tasks in Equation. 5.1 will be reduced,
leading to a shorter training time. However doing so requires more embedded memory resource
in FPGA.
5.4 Conclusion
In this chapter the Hoeffding tree is incorporated with the data parallelism based training archi-
tecture in order to handle training datasets that are too large to be stored in the FPGA on-chip
memory. For the test dataset, the resulting architecture outperforms the GPP implementation
and has comparable performance with modest GPU implementation, however its training speed
is not as fast as the implementation based on state of art GPU. The performance of the pro-
posed architecture is mainly limited by the dated external memory and FPGA that are used
for evaluation, as well as a lower working frequency. Therefore it is believed that with state of
82 Chapter 5. Incremental Training Architecture
art FPGA and memory devices, the performance gap between the FPGA implementation and
state of art GPU implementation can be narrowed.
Chapter 6
Conclusion
6.1 Summary of Thesis Achievements
This work investigates the hardware acceleration of Random Forest training process on an
FPGA platform and proposes a set of FPGA-based architectures that target different parallel
processing schemes inherent in the training algorithm. The architectures attempt to speedup
the training speed from different angles. Firstly the architectures take advantage of different
types of inherent parallel schemes in the RF training algorithm. The workload involved is sep-
arated in various ways and is distributed to parallel processing elements. Secondly the parallel
workers as well as the top level architecture are carefully designed to exploit the flexibility in
FPGA devices, achieving low latency and high throughput. When combined, the proposed
hardware optimisations lead to good performance in terms of training speed and comparable
accuracy when compared with implementations based on other platforms. With inherent low
power consumption on FPGA devices, the proposed architectures are particularly suitable for
embedded/portable applications.
All of the training architectures proposed in this work are based on an FPGA-based hardware
framework that is introduced in Chapter 3. The framework implements a workflow based on
the breadth first training strategy and includes a set of fundamental hardware components
that are used throughout the various proposed architectures in this work. Also introduced in
83
84 Chapter 6. Conclusion
Chapter 3 is a training architecture that adopts task parallelism as the strategy for splitting
and distributing the workflow. The architecture demonstrates a speedup of up to two orders
of magnitude over a set of training tools running on a single-core 2.6GHz CPU.
The task parallelism based design comes with a drawback that it can not be applied to train-
ing problems with parameter mtry being greater than one. mtry is the number of attributes
randomly selected for each non-leaf node during the training process. In Chapter 4 this limita-
tion is resolved by introducing a second training architecture adopting vertical data parallelism
strategy. The architecture achieves the same level of speedup but allowing mtry to vary. In
addition two enhancements are brought to the architecture including support for multiple num-
ber (>2) of labels in the training data as well as an optimisation scheme that improves the
efficiency of embedded memory utilisation.
In Chapter 5 a third training architecture is proposed to tackle training problems involving
training datasets that are too big to be moved entirely to the embedded memory in FPGA,
and opens the way for the proposed system to tackle big data problem. The architecture treats
a large-scale training dataset as a source of streaming data and incorporates an incremental
training algorithm targeting streaming data to workaround the limitation imposed by the small
embedded memory. The resulting architecture implemented on a modest FPGA is estimated to
be able to achieve a training speed that is comparable to a modest 448-cores GPU implementa-
tion and compares favourably with a Python-based implementation running on a 6-core 2.3GHz
CPU. In all cases the proposed architecture features superior performance/power-consumption
ratio due to inherent low power consumption in FPGA.
Returning to the three questions raised in Section 1.1, for the proposed training framework,
it has become clear that all three layers of parallelism in the RF training process, the DT-
wise parallelism, the task parallelism and the data parallelism, can be exploited in the FPGA
implementations and can effectively speed up the training process. Among which DT-wise
parallelism and data parallelism are more practical than task parallelism due to the fact that
the latter does not support an adjustable parameter mtry which is commonly used in practice.
The three architectures proposed in this work try to answer the second question regarding the
6.2. Future Work 85
optimal hardware architecture for RF training process in FPGA. In all cases, the proposed
architectures intend to distribute the workload to parallel workers by utilising the high on-chip
memory bandwidth. Once the workload can be efficiently distributed, the total training time
can be effectively reduced, this is demonstrated by all three training architectures proposed. As
for the third question regarding the limiting factors in FPGA, it is believed that low working
frequency and hardware resource capacity are the major limiting factors. These two issues are
expected to be alleviated in future FPGA devices.
So far designing FPGA-based implementations requires expertise in hardware circuit design
which is not known by most people working in data mining or machine learning communities.
Given that a good acceleration is achievable by using GPU implementations and that soft-
ware programming related to GPU implementations is more flexible and requires less effort to
use, GPU implementations have drawn more attention than their FPGA counterparts. How-
ever FPGA implementations hold advantage in power consumption by using around 60 times
(considering system power requirement for GPU) lower power than GPU implementations do.
Therefore FPGA-based training architectures are more suitable for embedded applications in
which high speed RF training is need.
6.2 Future Work
Future work on the FPGA-based RF training architecture contains three aspects including the
capability to handle extra-wide training datasets, applying an embedded memory optimisa-
tion scheme to the incremental training architecture and enhancement of functionalities in the
training architectures.
When instantiated on a modest FPGA, the proposed hardware training framework is capable
of handling datasets with several hundreds of attributes, which is enough to fulfil a part of
training problems in various applications. However it is common for some applications to
involve up to hundreds of thousands of attributes. One of such application is in bioinformatics
in which tens of thousands of different single nucleotide polymorphisms (SNPs) are used as
86 Chapter 6. Conclusion
potential attributes in the training data. Accommodating so many attributes in the proposed
framework requires a significant reduction in the number of training instances allowed in the
dataset. Therefore it is important to investigate potential solutions to enable the architecture
to efficiently process large number of attributes.
The memory optimisation scheme proposed in Chapter 4 can not be applied to the incremental
training architecture introduced in Chapter 5 under current hardware design since the real
values in the training data are replaced by the relative ranking numbers during the optimisation.
However the real values are needed in the predicting process during the incremental training.
If the information needed by the predicting process could be extracted from the actual values
of the training data, the memory optimisation scheme would further alleviate the limitation of
embedded memory in FPGA.
Finally in this work the focus is put on the RF training process for classification purpose which
is one of the cores in the RF tool. RF has been developed as a powerful tool consisting of many
functionalities such as regression and variable importance estimation. The training architec-
tures proposed in this work will be further enhanced to contain these functionalities to make
them more competitive in terms of completeness when compared with software implementa-
tions.
Bibliography
[1] Nuno Amado, Joao Gama, and Fernando Silva. Exploiting parallelism in decision tree
induction. In Proceedings from the ECML/PKDD Workshop on Parallel and Distributed
computing for Machine Learning, pages 13–22, 2003.
[2] Davide Anguita, Andrea Boni, and Sandro Ridella. A digital architecture for support
vector machines: theory, algorithm, and FPGA implementation. Neural Networks, IEEE
Transactions on, 14(5):993–1009, 2003.
[3] Henrik Bostro¨m. Concurrent learning of large-scale random forests. In SCAI, volume 227,
pages 20–29, 2011.
[4] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
[5] Leo Breiman. Out-of-bag estimation. Technical report, Citeseer, 1996.
[6] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[7] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classification
and regression trees. CRC press, 1984.
[8] Mihai Budiu, Jamie Shotton, Derek G Murray, and Mark Finocchio. Parallelizing the
training of the kinect body parts labeling algorithm. Big Learning: Algorithms, Systems
and Tools for Learning at Scale, pages 1–6, 2011.
[9] Grigorios Chrysos, Panagiotis Dagritzikos, Ioannis Papaefstathiou, and Apostolos Dollas.
HC-CART: a parallel system implementation of data mining classification and regression
87
88 BIBLIOGRAPHY
tree (cart) algorithm on a multi-fpga system. ACM Transactions on Architecture and Code
Optimization (TACO), 9(4):47, 2013.
[10] Altera Corporation. Trimatrix embedded memory blocks in stratix iv de-
vices. https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/
literature/hb/stratix-iv/stx4_siv51003.pdf. Accessed: 01-Jul-2015.
[11] A Criminisi, J Shotton, and E Konukoglu. Decision forests for classification, regression,
density estimation, manifold learning and semi-supervised learning. Microsoft Research
Cambridge, Tech. Rep. MSRTR-2011-114, 5(6):12, 2011.
[12] Johannes Gehrke, Venkatesh Ganti, Raghu Ramakrishnan, and Wei-Yin Loh. BOAT op-
timistic decision tree construction. In ACM SIGMOD Record, volume 28, pages 169–180.
ACM, 1999.
[13] Johannes Gehrke, Raghu Ramakrishnan, and Venkatesh Ganti. RainForest a framework
for fast decision tree construction of large datasets. In VLDB, volume 98, pages 416–427,
1998.
[14] Robin Genuer, Jean-Michel Poggi, and Christine Tuleau-Malot. Variable selection using
random forests. Pattern Recognition Letters, 31(14):2225–2236, 2010.
[15] Benjamin A Goldstein, Alan E Hubbard, Adele Cutler, and Lisa F Barcellos. An applica-
tion of random forests to a genome-wide association dataset: methodological considerations
& new findings. BMC genetics, 11(1):49, 2010.
[16] H˚akan Grahn, Niklas Lavesson, Mikael Hellborg Lapajne, and Daniel Slat. Cudarf: A
CUDA-based implementation of random forests. In Computer Systems and Applications
(AICCSA), 2011 9th IEEE/ACS International Conference on, pages 95–101. IEEE, 2011.
[17] David Heath, Simon Kasif, and Steven Salzberg. Induction of oblique decision trees. In
IJCAI, pages 1002–1007. Citeseer, 1993.
[18] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal
of the American statistical association, 58(301):13–30, 1963.
BIBLIOGRAPHY 89
[19] Kurt Hornik. R FAQ. https://CRAN.R-project.org/doc/FAQ/R-FAQ.html, 2015. Ac-
cessed: 01-Jul-2015.
[20] Geoff Hulten, Pedro Domingos, and Laurie Spencer. Mining massive data streams. The
Journal of Machine Learning Research, 2005.
[21] Andreas Ittner and Michael Schlosser. Non-linear decision trees-NDT. In ICML, pages
252–257. Citeseer, 1996.
[22] Karl Jansson, Hakan Sundell, and Henrik Bostrom. gpuRF and gpuERT: Efficient and
scalable gpu algorithms for decision tree ensembles. In Parallel & Distributed Processing
Symposium Workshops (IPDPSW), 2014 IEEE International, pages 1612–1621. IEEE,
2014.
[23] Ruoming Jin and Gagan Agrawal. Communication and memory efficient parallel decision
tree construction. In SDM, pages 119–129. SIAM, 2003.
[24] Mahesh V Joshi, George Karypis, and Vipin Kumar. ScalParC: A new scalable and efficient
parallel classification algorithm for mining large datasets. In Parallel processing symposium,
1998. IPPS/SPDP 1998. proceedings of the first merged international... and symposium on
parallel and distributed processing 1998, pages 573–579. IEEE, 1998.
[25] Dirk Koch and Jim Torresen. Fpgasort: A high performance sorting architecture exploiting
run-time reconfiguration on FPGAs for large problem sorting. In Proceedings of the 19th
ACM/SIGDA international symposium on Field programmable gate arrays, pages 45–54.
ACM, 2011.
[26] Abbas Z Kouzani. Subcellular localisation of proteins in fluorescent microscope images
using a random forest. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on
Computational Intelligence). IEEE International Joint Conference on, pages 3926–3932.
IEEE, 2008.
[27] Richard Kufrin. Decision trees on parallel processors. Machine Intelligence and Pattern
Recognition, 20:279–306, 1997.
90 BIBLIOGRAPHY
[28] Delon Levi. Hereboy: A fast evolutionary algorithm. In Evolvable Hardware, 2000. Pro-
ceedings. The Second NASA/DoD Workshop on, pages 17–24. IEEE, 2000.
[29] Yisheng Liao, Alex Rubinsteyn, Russell Power, and Jinyang Li. Learning random forests
on the GPU. Department of Computer Science, New York University, 2013.
[30] Andy Liaw and Matthew Wiener. Package ’randomForest’, 2015.
[31] M. Lichman. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2013.
University of California, Irvine, School of Information and Computer Sciences.
[32] Kathryn L Lunetta, L Brooke Hayward, Jonathan Segal, and Paul Van Eerdewegh. Screen-
ing large-scale association study data: exploiting interactions using random forests. BMC
genetics, 5(1):32, 2004.
[33] Rui Marcelino, Hora´cio Neto, and Joao MP Cardoso. Sorting units for FPGA-based
embedded systems. In Distributed Embedded Systems: Design, Middleware and Resources,
pages 11–22. Springer, 2008.
[34] Manish Mehta, Rakesh Agrawal, and Jorma Rissanen. SLIQ: A fast scalable classifier for
data mining. In Advances in Database TechnologyEDBT’96, pages 18–32, 1996.
[35] Jean-Michel Muller, Nicolas Brisebarre, Florent De Dinechin, Claude-Pierre Jeannerod,
Vincent Lefevre, Guillaume Melquiond, Nathalie Revol, Damien Stehle´, and Serge Torres.
Handbook of floating-point arithmetic. Springer Science & Business Media, 2009.
[36] Sreerama K. Murthy, Simon Kasif, and Steven Salzberg. A system for induction of oblique
decision trees. Journal of artificial intelligence research, 1994.
[37] Ramanathan Narayanan, Daniel Honbo, Gokhan Memik, Alok Choudhary, and Joseph
Zambreno. An fpga implementation of decision tree classification. In Design, Automation
& Test in Europe Conference & Exhibition, 2007. DATE’07, pages 1–6. IEEE, 2007.
[38] Aziz Nasridinov, Yangsun Lee, and Young-Ho Park. Decision tree construction on GPU:
ubiquitous parallel computing approach. Computing, 96(5):403–413, 2014.
BIBLIOGRAPHY 91
[39] Nvidia. Geforce gtx titan specifications. http://www.geforce.co.uk/hardware/
desktop-gpus/geforce-gtx-titan/specifications. Accessed: 01-Jul-2015.
[40] Amos R Omondi and Jagath Chandana Rajapakse. FPGA implementations of neural
networks, volume 365. Springer, 2006.
[41] Fabian Pedregosa, Gae¨l Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand
Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg,
et al. Scikit-learn: Machine learning in python. The Journal of Machine Learning Research,
12:2825–2830, 2011.
[42] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
[43] Sanjay Ranka and V Singh. CLOUDS: A decision tree classifier for large datasets. Knowl-
edge discovery and data mining, pages 2–8, 1998.
[44] Daniel F Schwarz, Inke R Ko¨nig, and Andreas Ziegler. On safari to random jungle: a fast
implementation of random forests for high-dimensional data. Bioinformatics, 26(14):1752–
1758, 2010.
[45] John Shafer, Rakesh Agrawal, and Manish Mehta. SPRINT: A scalable parallel classifier
for data mining. In Proc. 1996 Int. Conf. Very Large Data Bases, pages 544–555, 1996.
[46] Toby Sharp. Implementing decision trees and forests on a GPU. In Computer Vision–
ECCV 2008, pages 595–608. Springer, 2008.
[47] Bram Slabbinck, Bernard De Baets, Peter Dawyndt, and Paul De Vos. Towards large-scale
FAME-based bacterial species identification using machine learning techniques. Systematic
and Applied Microbiology, 32(3):163–176, 2009.
[48] Mahesh K Sreenivas, Khaled Alsabti, and Sanjay Ranka. Parallel out-of-core divide-and-
conquer techniques with application to classification trees. In Parallel Processing, 1999.
13th International and 10th Symposium on Parallel and Distributed Processing, 1999. 1999
IPPS/SPDP. Proceedings, pages 555–562. IEEE, 1999.
92 BIBLIOGRAPHY
[49] Rastislav JR Struharik and Ladislav A Novak. Evolving decision trees in hardware. Journal
of Circuits, Systems, and Computers, 18(06):1033–1060, 2009.
[50] Terasic Technologies. DE4 FPGA Development Board User Manual, v1.2 edition, 2012.
[51] Ah Chung Tsoi and RA Pearson. Comparison of three classification techniques: CART,
C4. 5 and multi-layer perceptrons. In Advances in neural information processing systems,
pages 963–969, 1991.
[52] Brian Van Essen, Chris Macaraeg, Maya Gokhale, and Ryan Prenger. Accelerating a
random forest classifier: Multi-core, GP-GPU, or FPGA? In Field-Programmable Cus-
tom Computing Machines (FCCM), 2012 IEEE 20th Annual International Symposium on,
pages 232–239. IEEE, 2012.
[53] Seyi Verma. 40-nm fpga power management and advantages. Technical report, Altera
Corporation, 2008.
[54] Yaya Xie, Xiu Li, EWT Ngai, and Weiyun Ying. Customer churn prediction using improved
balanced random forests. Expert Systems with Applications, 36(3):5445–5449, 2009.
[55] Weiyun Ying, Xiu Li, Yaya Xie, and Ellis Johnson. Preventing customer churn by using
random forests modeling. In Information Reuse and Integration, 2008. IRI 2008. IEEE
International Conference on, pages 429–434. IEEE, 2008.
