Speculative Parallel Evaluation Of Classification Trees On GPGPU Compute
  Engines by Spencer, Jason
Speculative Parallel Evaluation Of Classification Trees On
GPGPU Compute Engines
Jason Spencer ∗
August 6, 2018
Abstract
We examine the problem of optimizing classification tree evaluation for on-line and real-time appli-
cations by using GPUs. Looking at trees with continuous attributes often used in image segmentation,
we first put the existing algorithms for serial and data-parallel evaluation on solid footings. We then
introduce a speculative parallel algorithm designed for single instruction, multiple data (SIMD) architec-
tures commonly found in GPUs. A theoretical analysis shows how the run times of data and speculative
decompositions compare assuming independent processors. To compare the algorithms in the SIMD
environment, we implement both on a CUDA 2.0 architecture machine and compare timings to a serial
CPU implementation. Various optimizations and their effects are discussed, and results are given for all
algorithms. Our specific tests show a speculative algorithm improves run time by 25% compared to a
data decomposition.
keywords: Classification Trees, Decision Tree Evaluation, Parallel Algorithms, GPU Computing,
Speculative Decomposition, Optimization, Image Segmentation.
1. Introduction
Classification trees are used to solve problems in ar-
eas as diverse as target marketing, fraud detection,
pattern recognition, computer vision, and medical
diagnosis. In many applications, classification trees
are carefully designed once but then applied to many
data sets to provide automated classifications. This
approach is used to create validated classifiers for tis-
sue classification in mammography [12] and intravas-
cular ultrasound [11] diagnostic procedures. While
training the classifier is done offline, tree evaluation of
each patient’s data in these applications is an on-line
algorithm where a user waits for a classification to be
performed on many, many samples. Time spent wait-
ing for this evaluation consumes valuable procedure
room equipment and personnel. Performance require-
ments only increase when single images are replaced
by moving video for computer vision applications, as
in robotic navigation [1]. In this environment, many
classifications are needed in real-time to compute and
affect a timely response. Thus the need for high-
performance on-line evaluation of classification trees
ranges from beneficial to absolutely necessary.
The assignment of a class to a given sample from
a dataset requires that the sample be evaluated at
each decision point along its path from the root of
the tree to its eventual terminal leaf. While it may
seem that each decision must be made in series for
that sample, we note that each sample’s classifica-
tion is independent of all other samples. This allows
us to decompose the problem of classifying all sam-
ples in a dataset into the independent problems of
classifying each sample, which can be done in par-
allel. This decomposition according to sample data
(a data decomposition approach) makes a growing
number of parallel computing architectures available
to speedup tree evaluation.
There is a good deal of literature on paralleliza-
tion of training algorithms used to create classifica-
tion trees [2, 6, 7, 10, 14, 18] in a traditional par-
allel processing setting. Research on the tree eval-
uation problem, however, seems to focus on Graph-
ics Processing Units (GPUs) as the implementation
∗School of Computing and Digital Media, DePaul University, Chicago, IL, USA
email: jspenc14@cdm.depaul.edu
1
ar
X
iv
:1
11
1.
13
73
v1
  [
cs
.D
C]
  6
 N
ov
 20
11
platform. GPUs are typically designed specifically
for data parallel applications. As inexpensive, com-
modity hardware found on every standard PC, GPUs
match the cost, size, and power requirements of the
on-line tree evaluation problem setting more closely
than traditional super computers. Such application
of graphics hardware to generic problems has become
known as General Purpose GPU (GPGPU) comput-
ing.
An early expedition into GPGPU techniques for
machine learning can be found in [16], but application
to tree evaluation was first proposed by Sharp in [15].
His framework stores the tree as an array of nodes
containing the decision criteria of that node and an
index used to locate the next node. Subsequent node
indices are computed without conditional branches
to avoid their heavy performance penalties on most
GPUs. The tree definition is passed to the GPU as
a texture map used by a custom pixel shader. The
shader consumes input feature data and combines it
with the texture to produce a final value, the assigned
class, for each pixel in parallel. Sharp extends this
to evaluate random forests by concatenating multi-
ple tree structures in the texture data and iterating
over all trees. Results show a speedup of roughly two
orders of magnitude over host-based algorithms.
In [1], Baumstarck also uses a data parallel ap-
proach on GPUs for a computer vision application,
available in [5]. The implementation is done di-
rectly on the Compute Unified Device Architecture
(CUDA) platform offered by NVIDIA Corporation [3]
without using graphics libraries. Though condition-
als are used in the tree traversal, Baumstarck reports
a fifty-fold speedup of forest evaluation.
In this paper, we investigate a speculative ap-
proach to tree evaluation on massively parallel GPU
architectures, namely CUDA. Rather than treating
the full evaluation of one sample as the atomic paral-
lel task, we parallelize the evaluation of each node in
the tree for a single sample then reduce the resulting
path through the tree in parallel. This approach has
some performance benefits on architectures where ex-
ecution of parallel processors is not independent, as
in SIMD machines. We compare this approach to the
data decomposition used in previous work and to the
best-known serial host algorithm, both of which we
restate here so that all approaches are put on a solid
footing. In the specific environment we examine, re-
sults for speculative decomposition show a 25% per-
formance improvement over data decomposition. We
also see that host memory bandwidth and data dis-
tribution is an important measurement consideration
that can dominate the nuances of GPU performance
gains in typical PC systems, and must be accounted
for in any statement of speedup results.
2. Preliminaries
2.1 Classification Trees
In evaluating a classification tree, we are given a set
of records, called the dataset, and a full binary deci-
sion tree, called the classifier. Each record in the
dataset contains several fields, called attributes or
features. One of the attributes, the classifying at-
tribute, indicates to which class the record belongs
and is unknown. In the general case, attributes can
be continuous, having (real) numerical values from
an ordered domain, or categorical, representing values
from an unordered set. The classifier is a predictive
model created through a process known as training.
In training, observations on a training set of records,
each having a known classifying attribute, are used to
build a tree such that each interior, or decision, node
uses a single attribute value test to partition the set
of records recursively until the subset of records at a
given node have a uniform class. Such nodes are en-
coded in the tree as leaf nodes. The evaluation of a
dataset is complete when the trained classifier is used
to determine to which leaf, and thereby which class,
each record belongs.
There are several training algorithms for exam-
ining attributes and generating trees. The particu-
lar algorithm used will not concern us here, so long
as the resulting tree has the above properties. We
examine trees where all attributes are continuous, a
common occurrence in image segmentation. While
we will look at real-valued attributes (approximated
with floating point numbers), ordered discrete val-
ues would behave very much the same. Categorical
attributes, though, would likely require some modifi-
cations to our approach. We will further assume that
class values can be enumerated and put into one-to-
one correspondence with the natural numbers. Evalu-
ation will operate only on numbers, and any mapping
to another representation for class values (e.g. to de-
scriptive strings or pixel values) will be done outside
the evaluation process.
2.2 CUDA GPUs
GPGPU computing has grown in popularity in recent
years as a technique for improving performance for
2
massively parallel applications, especially where visu-
alization and images are concerned. Initially, generic
parallel computing was achieved on GPUs by cleverly
mapping the processing into the graphics domain us-
ing libraries such as OpenGL to perform primitive
tasks. As demand for customized graphics process-
ing grew, vendors began supporting domain-specific
programming languages like GL Shading Language
(GLSL), making the GPU’s floating point units more
available.
In recent years, GPGPU computing frameworks
have made great strides in removing assumptions
about the domain and providing a generic capabil-
ity to be used in any application needing massive
parallelization. Perhaps the leading such framework,
NVIDIA’s CUDA architecture, can add tens or hun-
dreds of GigaFLOPs to a system’s capability on a
single adapter card.
This power can be brought to bear on generic
problems with great ease of use. The program-
ming environments for these devices, whether vendor-
specific or the industry standard OpenCL, can be
used with no reference to the graphics domain. These
environments subset the C/C++1 programming lan-
guage and provide a set of keyword extensions to
manage the generation of both device-specific code
and host code from the same source file set. In this
way, code written to run on the GPU, called a kernel,
is invoked with something that feels very akin to a C
function call.
2.2.1 CUDA Programming Model
The CUDA runtime executes kernels across many
threads, or individual streams of instructions (usu-
ally for a single atomic parallel task), and manages
the mechanics of scheduling in hardware. Threads are
grouped into blocks as 1, 2, or 3 dimensional arrays
with each thread having a unique identifying index
in each dimension of the block. Further, blocks are
grouped into a 1 or 2 dimensional grid, with each
block again having an identifying index in the grid
dimensions. Each kernel invocation is done over a
single grid and gives the grid and block dimensions
to use when launched. Threads within a block are
allowed to synchronize and share memory, but no
communication between blocks is allowed. Threads
are scheduled and executed in 32-thread units called
a warp, with some operations happening on a half-
warp, or 16 threads. All threads have access to their
local memory (registers and stack), the shared mem-
ory of their block, and a global memory common to
the entire device. The host can read from and write
data to global memory but not local or shared mem-
ory. The host is required to copy kernel input and
output data to and from device global memory out-
side of the kernel execution.
A simple example helps to illustrate a typical ker-
nel invocation. First, the host CPU copies the in-
put data to the GPU device’s global memory. Since
the host and device address spaces are separate, the
CUDA runtime provides the host with APIs to allo-
cate storage in device space, copy memory between
spaces, look up device space symbol addresses, etc.
The host must also allocate device global memory to
store the results of the computation. The host can
then invoke the CUDA runtime to launch the kernel
with certain grid and block dimensions. Arguments
such as the input and output buffers in device space
are passed in the invocation. The device allocates
execution resources to the kernel grid and schedules
threads to execute in warps. Each thread uses its
block and thread indices to identify its associated
portions of input and output data. It can then do
thread-specific memory transfer to its own stack and
registers. Once the input data is locally available,
computation is done and output is stored in device
global memory. When all threads have completed,
the host is signaled and is then free to copy the results
from device to host memory and deallocate buffers.
2.2.2 CUDA Hardware Architecture
While an extensive discussion of CUDA architecture
is beyond the scope of this paper, some of the algo-
rithm designs given herein are driven by certain qual-
ities which bear discussion. The fundamental execu-
tion units of a CUDA device, called stream processors
and known as cores, are arranged in N -way SIMD
groups for some implementation-dependent N (usu-
ally 8, 32, or 48). These groups are combined with su-
per function units (SFUs), instruction cache/decode
logic, a register file, L1 cache/shared memory, (usu-
ally 2) warp schedulers, and a network interconnect
to form a streaming multiprocessor, or SM (Figure
1). All threads in a block will be executed on the
same SM, scheduled very efficiently by the hardware
warp schedulers. When a warp is scheduled, all
threads in that warp execute the same instruction,
but have their own registers and stack. When some
1NVIDIA represents that CUDA is a extension to ANSI C, but recent versions also allow for the use of classes.
3
threads take conditional branches different from other
threads, the warp executes the two paths in series un-
til the paths merge. This is known as a divergent path,
and can affect the kernel’s performance substantially.
When a warp encounters a long-latency instruc-
tion (such as global memory access), it can be
swapped for another warp in a small number of
clocks. There is a limit to this capability, however,
and the SM can only have so many blocks and threads
resident at a time. This concept is known as occu-
pancy, and can also affect the kernel’s performance.
Low occupancy means an SM has nothing to do dur-
ing long latency instructions, so the SM is not fully
utilized.
Figure 1: Streaming Multiprocessor detail (NVIDIA
Corporation)
Finally, accessing global memory from an SM is
an expensive operation, typically 100 times the cost
of accessing local memory. In some CUDA implemen-
tations, accesses to global memory that meet certain
requirements (such as contiguous access of 32, 64, or
128 bytes made in order by each core) can be coalesced
into a single read, improving throughput. Later ver-
sions of CUDA hardware add L1 and even L2 cache,
which mitigates the cost of non-coalesced reads.
See [3, 4, 8, 13] for a more complete and detailed
overview of the CUDA architecture.
3. Classification Tree Algorithms
It is natural to imagine an algorithm for evaluating a
decision tree using a binary tree data structure and
a depth-first traversal which, at each node, uses a
conditional to evaluate whether the traversal should
follow the left or right child of the node. Condi-
tional statements, however, present problems for tra-
ditional CPUs (in the form of branch misprediction
and pipeline flush) and GPUs (in the form of se-
rialized divergent paths for SIMD warp execution.)
Sharp avoids this problem in [15] by developing a
branchless tree traversal, which we will adopt for the
base serial evaluation algorithm. A host implemen-
tation of this algorithm, as the best known serial al-
gorithm, will be the reference by which speedup of
parallel algorithms is determined.
3.1 Branchless Tree Evaluation
The evaluation problem can be stated as follows:
given a dataset D = {R : R = (r1, . . . , rA), ra ∈ R}
with | D |= M and a full binary classification tree τ
with a set of nodes N = {n : n = (an, tn, d rn , d ln, cn)}
where:
• |N |= N is the number of nodes in τ
• 1 ≤ an ≤ A is the index of attribute ran in each
record R to be evaluated by node n
• tn ∈ R is the threshold for attribute ran used
by node n
• d ln ∈ {N
⋃ ∅} is n’s left descendant and recur-
sively evaluates R when ran ≤ tn
• d rn ∈ {N
⋃ ∅} is n’s right descendant and re-
cursively evaluates R when ran > tn
• cn ∈ {C
⋃⊥ : C ⊂ N is the set of possible class
values} is ⊥ when (d rn 6= ∅
∧
d ln 6= ∅) or some
c ∈ C when (d rn = ∅
∧
d ln = ∅)
and having a root node n0, assign to each R ∈ D a
cR ∈ C by recursively evaluating R beginning at n0.
4
Procedure 1 Breadth-first Encoding of Tree
1: breadthF irstTree = [ ]
2: Q=queue()
3: push(Q, n0)
4: i= 0
5: childIndex= 1
6: while Q not empty do
7: n = pop(Q)
8: node.attributeIndex = an
9: node.threshold = tn
10: node.classVal = cn
11: node.child = childIndex
12: breadF irstTree[i] = node
13: i = i+ 1
14: if d ln 6= ∅ then
15: push(Q, d ln)
16: childIndex = childIndex+ 1
17: if d rn 6= ∅ then
18: push(Q, d rn)
19: childIndex = childIndex+ 1
Procedure 2 Serial Tree Evaluation
1: Parameter: D
2: Parameter: breadthF irstTree[N ]
3: Output: assignedClasses[|D |]
4: for all R ∈ D do
5: i = 0
6: while breadthF irstTree[i].classVal = ⊥ do
7: a = breadthF irstTree[i].attributeIndex
8: t = breadthF irstTree[i].threshold
9: i = breadthF irstTree[i].childIndex + (ra >
t)
10: cR = breadthF irstTree[i].classVal
11: assignedClasses[R] = cR
To evaluate τ without branching, we first encode
N in a breadth-first array of nodes. Procedure 1
shows how each node is assigned an index i in the
array breadthF irstTree to create a data structure
describing the tree. Note that every right child has
an index that is one more than the neighboring left
child. Each node, then, need only store the index of
its left child. To compute the index of the next node
to evaluate, the node compares its attribute value ran
against its threshold tn using the Boolean predicate
“greater-than.” If the result is false and encoded as 0,
adding the result to the node’s child index will yield
the index of its left child, as desired. If the result is
true encoded as a 1, adding it to the child index will
yield the node’s right child’s index. While not strictly
branchless due to the while loop, this technique does
avoid any explicit conditional to compute the path to
take at each decision node. The general algorithm is
shown in Procedure 2.
3.2 Data Decomposition
Procedure 2 is parallelized by data decomposition al-
most trivially, since each record is independent of
the others. We can simply assign m records to p
processors and have each loop only over m. The
only additional work is to map the m records to
the global dataset for the purposes of indexing into
the input and output arrays. Procedure 3 shows
the algorithm for each processor with indexing de-
tails for parameters D and assignedClasses. We use
D[s..t) to mean the subset of elements of D begin-
ning at element s up to but not including element
t. Here, we assume a shared memory architecture
so that all processors have equal access to the pa-
rameter and output buffers. Knowing the index to a
record R in D also gives the index to the correspond-
ing assignedClasses value. The steps of making
D, breadthF irstTree, and assignedClasses avail-
able to each processor are omitted.
[15] uses a data parallel approach similar to this,
as does [1] when evaluating boosted decision trees
using CUDA, though the later uses conditional in-
structions to traverse the tree.
Procedure 3 Data-Parallel Tree Evaluation
1: Parameter: D
2: Parameter: breadthF irstTree[N ]
3: Parameter: m ∈ N, the number of records for
this processor to process
4: Parameter: p ∈ N, this processor’s rank
5: Output: assignedClasses[|D |]
6: for all R ∈ D[m · p ..m(p+ 1)) do
7: i = 0
8: while breadthF irstTree[i].classVal = ⊥ do
9: a = breadthF irstTree[i].attributeIndex
10: t = breadthF irstTree[i].threshold
11: i = breadthF irstTree[i].childIndex + (ra >
t)
12: cR = breadthF irstTree[i].classVal
13: assignedClasses[R] = cR
3.3 Speculative Decomposition
While a data decomposition applies multiple proces-
sors to the evaluation problem very efficiently, the
task of evaluating a single tree is still done serially.
5
Once m is reduced to 1, no further processors can
be applied to the problem usefully. Also, very deep
and unbalanced trees may lead to asymmetries in
the runtime between processors. In image segmenta-
tion, for instance, neighboring samples are expected
to take similar paths through the tree and have al-
most uniform class values. By luck of the draw,
some processor may be assigned m records that hap-
pen to be classified by the deepest node in the tree
while others have records classified at the top of the
tree. This leads to idle time in the “lucky” processors,
and thereby, practical inefficiency. Further, adjacent
records taking different paths leads to similar inef-
ficiencies in SIMD architectures like CUDA SMs or
Intel’s SSE instruction set.
We propose a speculative decomposition of the
problem to avoid the issues of divergent paths, irreg-
ular memory access patterns, idle time due to asym-
metrical processing times, and to provide more uni-
form evaluation times needed in deterministic, real-
time applications. We assign to each record a group of
p processors, called a record group, such that p = N .
If there are G such groups, the total number of pro-
cessors becomes P = Gp. Within the group, each
node n of the tree is assigned to processor pn. The
first step of the algorithm is to evaluate all nodes in
parallel. Each processor stores the child node index
i determined by the node evaluation into a shared
memory array, path, having one element for each pro-
cessor. The second step is to reduce the path through
the tree to the selected leaf node. This is done by
having each processor copy the path value of its child
node into its own element of path. That is, each node
finds its successor’s successor and adopts that as its
own successor. We can then think of the path ar-
ray as storing the eventual successor for each node,
with the eventual successor of the root node being
the terminal node for the record. This step must be
done synchronously across all processors in the record
group. Leaf nodes are specifically designed to always
evaluate to themselves by setting their threshold to
−∞ and their child index to be their own index.
Figure 2 shows an example tree and the path ar-
ray after the initial node evaluation (2b), then after
one (2c) and two (2d) steps of the parallel reduction
phase. Note that for a tree of depth d, only Θ(log2 d)
reduction steps are necessary for the root node to ar-
rive at the terminal leaf’s index. When this occurs,
the reduction terminates.
Procedure 4 gives the parallel algorithm, which
handles indexing the dataset as before but now ac-
counts for the specific record group g in the cal-
culation as well as determining which node of the
tree each processor is assigned to and setting up the
shared variable path. To compute the dataset indices,
we can follow the form of Procedure 3 but substitute
g for p. Again, we assume a shared arrangement for
the input dataset and the output assignedClasses
where the indices in each array correspond naturally.
We use the primitive barrier() to provide synchro-
nization on updates to path from within record group
g.
Procedure 4 Speculative Parallel Tree Evaluation
1: Parameter: D
2: Parameter: breadthF irstTree[N ]
3: Parameter: m ∈ N, the number of records for
this record group to process
4: Parameter: g ∈ N, the record group this pro-
cessor belongs to
5: Parameter: pn ∈ N, this processor’s rank in the
record group
6: Output: assignedClasses[|D |]
7: Shared Variable: path[N ]
8: for all R ∈ D[m · g ..m(g + 1)) do
9: a = breadthF irstTree[pn].attributeIndex
10: t = breadthF irstTree[pn].threshold
11: path[pn] = breadthF irstTree[pn].childIndex +
(ra > t)
12: barrier(g)
13: rootClass =
breadthF irstTree[path[0]].classVal
14: while rootClass = ⊥ do
15: path[pn] = path[path[pn]]
16: barrier(g)
17: rootClass =
breadthF irstTree[path[0]].classVal
18: cR = rootClass
19: assignedClasses[R] = cR
3.4 Improved Speculative Decomposition
A few inefficiencies exist in Procedure 4. First, pro-
cessors assigned to leaf nodes will always produce the
same, known output, and so their assigned processors
do no productive work. To avoid this waste, the path
array can be initialized with the known, static results
for all leaves. Processors will only be assigned to deci-
sion nodes such that 0 ≤ pn < (N−1)/2, the number
of internal nodes in a full binary tree. This means,
however, that mapping processors in a record group
to tree nodes is no longer a simple, sequential opera-
6
24
0
1
3
5 6
7 8
(a) Example Tree
1 3 2 6 4 5 8 7 8
0 1 2 3 4 5 6 7 8
(b) path after Node Evaluation
3 6 2 8 4 5 8 7 8
0 1 2 3 4 5 6 7 8
(c) path after One Reduction Iteration
8 8 2 8 4 5 8 7 8
0 1 2 3 4 5 6 7 8
(d) path after Two Reduction Iterations
Figure 2: Parallel Tree Path Reduction
tion. A tree-specific look-up table can accommodate
this. As the record group processes, each processor
will modify only the element of path it is assigned to.
Second, if the tree reduction is viewed probabilis-
tically, we see that most records will end up at some
leaf between levels 1 and d of the tree, averaging to
some dµ for the dataset. Checking the while condition
on line 14 of Procedure 4 for all levels dr < dµ leads
to an expected inefficiency. If dµ is known or can
be determined experimentally for the tree, reducing
dµ levels in a single while loop pass can provide an
average case performance enhancement by reducing
loop iterations and the number of barrier operations
required.
Procedure 5 gives the improved parallel algorithm
for speculative decomposition. We add the static
paths for the leafs of the tree on line 3, and use that to
initialize the path array in parallel on line 10. Each
processor must now initialize two elements of path
since there are only processors for the internal nodes.
We also add the processor-node map on line 4, which
records the node index i assigned to each processor.
Line 20 shows the concept of multiple reductions per
loop, though the optimal implementation will be tree-
specific.
Procedure 5 Speculative Parallel Tree Evaluation
1: Parameter: D
2: Parameter: breadthF irstTree[N ]
3: Parameter: leafPaths[N ]
4: Parameter: processorNodeMap[(N − 1)/2]
5: Parameter: m ∈ N, the number of records for
this record group to process
6: Parameter: g ∈ N, the record group this pro-
cessor belongs to
7: Parameter: pn ∈ N, this processor’s rank in the
record group
8: Output: assignedClasses[|D |]
9: Shared Variable: path[N ]
10: path[2pn] = leafPaths[2pn]
11: path[2pn + 1] = leafPaths[2pn + 1]
12: i = processorNodeMap[pn]
13: for all R ∈ D[m · g ..m(g + 1)) do
14: a = breadthF irstTree[i].attributeIndex
15: t = breadthF irstTree[i].threshold
16: path[i] = breadthF irstTree[i].childIndex +
(ra > t)
17: barrier(g)
18: rootClass =
breadthF irstTree[path[0]].classVal
19: while rootClass = ⊥ do
20: path[i] = path[path[path[i]]]
21: barrier(g)
22: rootClass =
breadthF irstTree[path[0]].classVal
23: cR = rootClass
24: assignedClasses[R] = cR
7
3.5 Management and Tuning of Parallel
Algorithms
Some management work is required for each algo-
rithm in sections 3.2, 3.3, and 3.4, but is omit-
ted for brevity and to preserve generality. This in-
cludes making the buffers for D, assignedClasses,
breadthF irstTree, and any of the other necessary
symbols available to all the parallel processors for
each algorithm. The mechanism for sharing these
buffers depends on the programming environment
used. Also, selection of optimal values for G and
m given P , N , M , and the available execution hard-
ware architecture is critical but entirely implementa-
tion dependent.
3.6 Analysis of Evaluation Algorithms
We now analyze the asymptotic behavior of these gen-
eral algorithms assuming a traditional parallel pro-
cessing setting of independent processors connected
via shared memory. We perform an average case run
time analysis by assigning dµto the be average depth
of the tree traversed by the records in the dataset.
This can be determined if the entire dataset is known
a priori, or can be statistically estimated given an
significant sample size, such as the training set. The
serial runtime for Procedure 2 for M records is given
by
T2 = Mdµ(te + tc)
where te is the time to evaluate a node’s attribute
against its threshold and tc is the time to compare
the new node’s class value to ⊥. We also refer to
tn = te + tc as the time needed to evaluate a node.
The run time for Procedure 3 is a function of P ,
the total number of processors applied, and is given
by
T3(P ) =
M
P
dµ(te + tc) + ti + ts(M)
where each processor classifies MP records, ti is the
time needed to compute the index in D assigned to
the each processor, and ts(M) is the time needed to
transmit M records on the shared memory machine
for processing. We can then examine the speedup of
Procedure 3 as
S3(P ) =
T2
T3(P )
=
Mdµ(te + tc)
M
P dµ(te + tc) + ti + ts(M)
=
P
1 + P (ti+ts(M))Mdµ(te+tc)
If we assume ts(M) = σM + γ for some σ, γ and we
ignore γ and ti as small constants, then this simplifies
asymptotically to
S3(P ) ≈ P
1 + Pσdµtn
which suggests the speedup will be decided by the
relative performance of the memory copy and the se-
rial node processing time. If they are very similar, we
would not expect much speedup. If memory copies
are very fast compared to node processing, some ben-
efit may be had. Likewise for the efficiency, given by
E3(P ) =
S3(P )
P
≈ 1
1 + Pσdµtn
we expect good results only when copy time is much
less than processing time.
For Procedure 5, the analysis is a bit more in-
volved. If each group of processors is assigned m =
M
G records for G groups of p processors such that
P = Gp, the parallel runtime is given by
T5(P ) =
Mp
P
(te + (log2 dµ) tc) + ti + ts(M)
and the speedup is
S5(P ) =
T2
T5(P )
=
Mdµ(te + tc)
Mp
P (te + (log2 dµ) tc) + ti + ts(M)
=
P
p(te+(log2 dµ)tc)
dµ(te+tc)
+ P (ti+ts(M))Mdµ(te+tc)
with efficiency
E5(P ) =
S5(P )
P
≈ 1
p(te+(log2 dµ)tc)
dµ(te+tc)
+ Pσdµtn
Making the same assumptions about ts(M), ti, and
γ, S5(P ) simplifies asymptotically to
S5(P ) ≈ Pp(te+(log2 dµ)tc)
dµ(te+tc)
+ Pσdµtn
For the values of P and dµ we examine, this should
not be very different from S3(P ). However, these
equations allow us to examine when S5(P ) > S3(P ),
which occurs when
8
p (te + (log2 dµ) tc)
dµ(te + tc)
< 1, or
p (te + (log2 dµ) tc) < dµ(te + tc)
p <
dµ(te + tc)
te + (log2 dµ) tc
If we further assume te and tc are roughly equivalent
operations (both being comparisons) and each taking
time t, we can simplify this to
p <
2tdµ
t (1 + log2 dµ)
p <
2dµ
1 + log2 dµ
(1)
For practical values of dµ, the slope of the graph of
1 is around 1/3. Since the number of decision nodes
grows faster than the average depth (at a rate de-
pendent on the balancing of the tree), we should not
expect a great speedup from Procedure 5 for any but
the most shallow trees.
4. Experiments on Parallel Classifica-
tion Tree Algorithms
The preceding analysis assumes each parallel node
execution is independent from the others. In GPUs,
particularly CUDA architecture, this is not the case.
We expect to see a performance benefit due to local
caching of neighboring records read from global mem-
ory in bursts, the SIMD coupling of execution nodes
evaluated in parallel for each sample, having multiple
SIMD groups resident and quickly switched to on the
chosen hardware, and other such concerns. These are
not general concerns but are specific to a particular
hardware architecture. In this setting, it makes sense
to pursue more specific analysis by experimentation.
The following sections detail experiments done on the
CUDA platform with runtime as the metric of per-
formance.
4.1 Problem Selection
We selected the Image Segmentation dataset from
UC Irvine’s Machine Learning Repository [17] as an
evaluation problem representative of tasks in medical
imaging or computer vision. This data set consists
of 2310 records for training and an additional 2099
for testing. Each record consists of 19 real-valued at-
tributes of a 3×3 pixel neighborhood and corresponds
to one of 7 discrete classes.
To generate a classifier based on this dataset,
we used the Orange component-based machine learn-
ing library available from [9]. This library provides
Python bindings to a mature C++ machine learning
library. We wrote a Python script to read the train-
ing set, train a classification tree, and generate C++
source code which encodes that tree according to Pro-
cedure 1. The resulting tree is shown schematically
in Figure 3. This tree has N = 31 nodes, 16 leaves,
and a depth of 11.
Further, the script also combined the training set
and the test set of records into a single table, then re-
peatedly randomized and output the records as C++
source code for easy inclusion in our test program.
This process was repeated until 16,384 C++ records
were generated. This set can be duplicated four times
at runtime to create a dataset having 65,536 records,
representing an image of 256× 256 pixels.
4.2 Experiment Setup
4.2.1 Machine Configuration
Experiments were performed on a Dell Optiplex 780
with an Intel Core2 Duo E8600 CPU running at 3.33
GHz, 4 GB RAM, and the Windows 7 64-bit oper-
ating system. An NVIDIA Quadro 2000 GPU card
was added with 1 GB of 128-bit RAM with a band-
width of 41.6 GB/s and 192 CUDA cores in 4 SMs
of 48 cores each with a 1.25 GHz processor clock.
Software on the system included the NVIDIA driver
version 263.06 and the CUDA 3.2.1 runtime Dll ver-
sion 8.17.12.6303. All compilation was done with Mi-
crosoft Visual Studio 2008 and the CUDA 3.2 De-
velopment Toolkit, with project files generated by
CMake version 2.8.3.
4.2.2 Tests Conducted
We created a program which, after building a dataset
of 65,536 records, ran three tree evaluation functions
500 times each on the full dataset. For each func-
tion call, the Windows high performance counter was
started before and stopped after the call and the delta
time was accumulated. This is called the outer time
for the algorithm. For those functions using a CUDA
kernel, a similar inner time was collected around just
the kernel invocation and excluded any time for mem-
ory copy to or from the GPU. During the kernel run-
time, the host CPU was made to wait until the kernel
9
rawred mean
hue mean
<=67.000
SKY (100.00%)
>67.000
region centroid row
<=0.878
GRASS (100.00%)
>0.878
value mean
<=160.500
PATH (100.00%)
>160.500
region centroid row
<=40.167
exgreen mean
>40.167
hue mean
<=156.500
CEMENT (100.00%)
>156.500
hue mean
<=-1.997
vedge mean
>-1.997
saturation mean
<=-2.090
WINDOW (100.00%)
>-2.090
rawred mean
<=0.507
region centroid col
>0.507
FOLIAGE (100.00%)
<=2.833
exred mean
>2.833
CEMENT (100.00%)
<=-13.278
short line density 5
>-13.278
WINDOW (100.00%)
<=0.056
CEMENT (100.00%)
>0.056
FOLIAGE (96.15%)
<=235.000
WINDOW (100.00%)
>235.000
intensity mean
<=2.556
FOLIAGE (100.00%)
>2.556
WINDOW (100.00%)
<=2.722
BRICKFACE (96.77%)
>2.722
CEMENT (100.00%)
<=-10.444
FOLIAGE (100.00%)
>-10.444
Figure 3: Experimental Classification Tree
10
completed. The three functions evaluated were as fol-
lows:
EvalTree(): This function implements Procedure 2,
a serial algorithm running on the host. Note
that this function records no inner time and
that the outer time does not include any mem-
ory copies since none are required for the host
to evaluate the dataset.
EvalTreeBySample(): This is the data parallel algo-
rithm given in Procedure 3. This function is
written in CUDA C, and performs a host-to-
device copy of the dataset and the tree defi-
nition before invoking the kernel. The grid is
formed of 512 blocks having 128 threads each,
all single-dimensioned. Only one record is eval-
uated per thread (i.e. m = 1.) For this func-
tion (and all other CUDA functions), the tree is
copied to device constant memory for caching
purposes. When the kernel completes, the host
copies the resulting class assignments back to
host memory and frees all device resources.
EvalTreeByNode(): This function fully implements
the improved speculative algorithm correspond-
ing to Procedure 5 with the following con-
siderations: constant memory is used for the
processor-node map and static leaf path buffers
in addition to the tree definition; multiple
reductions (specifically 2, determined empiri-
cally) are performed per iteration of the path
reduction loop; and the explicit barrier() op-
erations are omitted since each thread executes
synchronously within a warp. The shared mem-
ory path variable is initialized from the static
leaf buffer only once at kernel invocation. This
is safe since leaves never change and internal
nodes are re-initialized by the node evaluation
step done for each record. The grid is set to 128
blocks of 16×16 threads. Thus each block pro-
cesses 16 record groups in parallel, each record
group using p = 16 threads (a half-warp) to
evaluate a record. Note that there are only 15
internal nodes in the tree, so one thread is idle
per record group (assigned to a phantom node).
With 128 × 16 record groups, each must pro-
cess m = 32 records per group to cover 65,536
records exactly. Having thread geometry ex-
actly match data size allows us to remove checks
for over-sized grids–a non-portable practice but
one with a noticeable performance effect. Data
copies to and from the device were the same as
in EvalTreeBySample().
After each CUDA function call, the returned
buffer of class assignments was compared to the re-
sults obtained using the serial algorithm, and any dis-
crepancies were reported. None were found.
The entire program also ran with the CUDA pro-
filer enabled. This facility captures device times-
tamps and other metrics resulting from the program
execution.
4.3 Results
The program output giving the outer and inner times
along with related statistics is summarized in Table 1.
Most notable is that the serial evaluation on the host
is twice as fast as the fastest parallel GPU version.
This is surprising but perhaps a bit misleading, since
no great pains were taken to optimize the memory
copy tasks, all done in series. Pinning and aligning
the host memory buffers and overlapping copies with
computation are viable techniques to boost perfor-
mance for this problem. However, it does point out
that the methods used in [15] by Sharp to measure
a speedup of two orders of magnitude may be mis-
matched with our methods. Sharp also does not give
the serial algorithm used to compare with the parallel
algorithm, suggesting that perhaps a branchless serial
algorithm performs better than that used in [15].
In our main result, comparing the inner
times for kernel execution we see a roughly
25% performance increase in EvalTreeByNode
over EvalTreeBySample. Further experiments on
EvalTreeByNode showed that inclusion of a condi-
tional for checking an over-sized warp increased run-
time to roughly the same as EvalTreeBySample.
Withm = 1, timings were again roughly equal, show-
ing that the expense of the initial load of static paths
and the processor-node map are amortized over mul-
tiple record iterations. Values for m > 32 (with
related block resizing) showed no significant bene-
fit. This and other experiments suggests that CUDA
thread scheduling is as efficient as iterating in a for
loop.
Examination of the CUDA profiler output shows
similar results for kernel timings (Figure 4), though
uniformly lower than those measurable outside of the
CUDA driver. The GPU times confirm a ~25% im-
provement in kernel times of 353.47µs vs 485.17µs.
The time in the graph for “memcpyHtoD” shows the
11
Table 1: Outer and Inner Times According to High-Performance Counter
Algorithm
Average
Outer
Time
(µs)
Min
Outer
Time
(µs)
Max
Outer
Time
(µs)
Std Dev
Average
Inner
Time
(µs)
Min
Inner
Time
(µs)
Max
Inner
Time
(µs)
Std Dev
EvalTree
(Host)
1914.16 1900.48 2343.65 43.481 N/A N/A N/A N/A
EvalTreeBy
Sample
3907.57 3794.19 4741.2 77.2049 538.235 525.705 769.309 15.3554
EvalTreeBy
Node
3785.29 3685.17 4677.76 87.0612 404.466 394.817 432.698 10.9616
copy time of the data set and tree definitions (two
invocations per execution) for both CUDA functions
over 500 iterations each. Adding this and the “mem-
cpyDtoH” time to each of the kernel times gives the
outer time for each function, less time taken by the
host to allocate/free buffers and manage the function
calls.
The profiler data also shows EvalTreeByNode tak-
ing an average of 4373 divergent branches across all
threads due to the half-warp scheduling, whereas
EvalTreeBySample shows 3530 across all threads,
as each thread in a warp will iterate through the
tree a different number of times. EvalTreeByNode
had a global cache read hit rate of 70%, while
EvalTreeBySample had a hit rate of only 31%.
With fewer threads per block, EvalTreeBySample
encounters the limit on active blocks, leaving the
achieved occupancy rate at 66%. EvalTreeByNode
avoids this issue and achieves 100% occupancy. This
increases the number of global memory requests for
record data that can be active, and thus enhances
the effect of latency hiding by the warp scheduler.
This can be seen in the global memory write through-
put of 0.643 GB/s versus 4.68 GB/s. Read through-
puts are roughly equal at 14 GB/s (due to caching),
giving overall global memory throughputs of 15.43
GB/s for EvalTreeBySample and 19.41 GB/s for
EvalTreeByNode.
5. Conclusion
We have shown a speculative decomposition algo-
rithm for parallel classification tree evaluation that
surpasses the performance of a data decomposition
parallel algorithm on the CUDA platform. When ig-
noring the common, serial algorithm setup process-
ing, the speculative approach is 25% faster than the
data parallel approach in our particular problem in-
stance. This demonstrates how different parallel de-
composition techniques can maximize the advantages
of a given platform. In a SIMD environment, we
see that speculative decomposition into many time-
uniform tasks can have a helpful effect even at the
cost of less efficient hardware utilization. We also see
a good example of implementation results deviating
from asymptotic theoretical analysis. This is most
true when fundamental assumptions, such as inde-
pendent execution units, do not hold in the imple-
mentation as is the case here. Ultimately, the best
performance requires a careful balance of machine
and algorithm for a specific problem.
Additionally, we’ve seen that measurement tech-
niques which do not include the entire program over-
head of distributing data or that compare different
algorithms can lead to confusing results. Though
we have implemented a very similar program to [15],
our serial host implementation is roughly twice as
fast when all overhead in included, compared to 100
times faster as Sharp reports. Surely, some difference
in host speed, GPU power, and lower overhead cost
when processing forests rather than single trees is re-
sponsible for part of this discrepancy. The remaining
difference suggests that the branchless evaluation al-
gorithm ought to be used as the best known serial
algorithm for speedup comparisons.
6. Further Work
The breadth of this result should be tested against
other tree geometries (e.g. more or less balanced,
deeper or more shallow) and record distributions (or-
dered vs. random) to observe the effect different data
organizations can have on run times. Also, applica-
tion of these algorithms to more traditional SIMD,
i.e. vector, processors would be interesting. Com-
paring CUDA compute 1.x devices with 2.x devices
12
Figure 4: Average timings taken by CUDA runtime over 500 executions (µs)
might also provide additional insights.
To extend the current work, application to very
large trees might be achieved by evaluating only a
small “window” on the tree, starting at a root node
and evaluating only the next few levels. Once re-
duced, the resulting node would then become the root
of the next window and the process repeated. This
approach may be useful in overcoming SIMD con-
currency limits (such as on a vectored processor) or
the exponential growth of memory demand for deeper
and deeper levels of the tree.
References
[1] Paul Baumstarck. GPU parallel processing for
fast robotic perception. Thesis, Engineer’s de-
gree, Stanford University, December 2009.
[2] Yael Ben-Haim and Elad Tom-Tov. A streaming
parallel decision tree algorithm. J. Mach. Learn.
Res., 11:849–872, March 2010.
[3] NVIDIA Corporation. CUDA Zone.
http://www.nvidia.com/object/cuda_home_
new.html, Feb 2011.
[4] NVIDIA Corporation. NVIDIA Developer
Zone. http://developer.nvidia.com/
object/gpucomputing.html, Feb 2011.
[5] Stephen Gould, Olga Russakovsky, Ian Good-
fellow, Paul Baumstarck, Andrew Y. Ng, and
Daphne Koller. The STAIR Vision Library
(v2.4). http://ai.stanford.edu/~sgould/
svl, May 2010.
[6] Ruoming Jin and Gagan Agrawal. Shared mem-
ory parallelization of decision tree construction
using a general data mining middleware. In
Proceedings of the 8th International Euro-Par
Conference on Parallel Processing, Euro-Par ’02,
pages 346–354, London, UK, 2002. Springer-
Verlag.
[7] Mahesh V. Joshi, George Karypis, and Vipin
Kumar. Scalparc: A new scalable and efficient
parallel classification algorithm for mining large
datasets. In Proc. of the International Parallel
Processing Symposium, pages 573–579, 1998.
[8] David B. Kirk and Wen-mei W. Hwu. Program-
ming Massively Parallel Processors: A Hands-on
Approach. Morgan Kaufmann, 1st edition, Feb
2010.
[9] Faculty of Computer Laboratory of Artificial In-
telligence and Information Science. Orange for
python 2.6. http://orange.biolab.si/.
[10] Manish Mehta, Rakesh Agrawal, and Jorma Ris-
sanen. Sliq: A fast scalable classifier for data
mining. In Proc. of the Fifth International
Conference on Extending Database Technology
(EDBT), pages 18–32, Avignon, France, March
1996.
[11] A. Nair, B. Kuban, E. Tuzcu, P. Schoenhagen,
S. Nissen, and D. Vince. Coronary plaque classi-
fication with intravascular ultrasound radiofre-
quency data analysis. Circulation, 106:2200–
2206, October 2002.
[12] Arnau Oliver and Jordi Freixenet. Automatic
classification of breast density. In IEEE Inter-
national Conference on Image Processing, pages
1258–1261, 2005.
[13] Jason Sanders and Edward Kandrot. CUDA by
Example: An Introduction to General-Purpose
GPU Programming. Addison-Wesley Profes-
sional, 1st edition, July 2010.
13
[14] John Shafer, Rakeeh Agrawal, and Manish
Mehta. Sprint: A scalable parallel classifier for
data mining. In Proceedings of the 22nd Inter-
national Conference on Very Large Databases
(VLDB), pages 544–555. Morgan Kaufmann,
September 1996.
[15] Toby Sharp. Implementing decision trees and
forests on a gpu. In European Conference on
Computer Vision (ECCV) 2008, volume 5305 of
Lecture Notes in Computer Science, pages 595–
608. Springer, 2008.
[16] D. Steinkraus, I. Buck, and P.Y. Simard. Using
GPUs for machine learning algorithms. In Doc-
ument Analysis and Recognition, 2005. Proceed-
ings. Eighth International Conference on, pages
1115 – 1120 Vol. 2, 29 Aug.-1 Sept. 2005.
[17] UCI Machine Learning Repository. Image
Segmentation data set. http://archive.ics.
uci.edu/ml/datasets/Image+Segmentation,
November 1990.
[18] Mohammed J. Zaki, Ching-Tien Ho, and Rakesh
Agrawal. Parallel classification for data mining
on shared-memory multiprocessors. Data En-
gineering, International Conference on, 0:198,
1999.
14
