We present a CUDA based implementation of a decision tree construction algorithm within the gradient boosting library XGBoost. The tree construction algorithm is executed entirely on the GPU and shows high performance with a variety of datasets and settings, including sparse input matrices. Individual boosting iterations are parallelized, combining two approaches. An interleaved approach is used for shallow trees, switching to a more conventional radix sort based approach for larger depths. We show speedups of between 3-6x using a Titan X compared to a 4 core i7 CPU, and 1.2x using a Titan X compared to 2x Xeon CPUs (24 cores). We show that it is possible to process the Higgs dataset (10 million instances, 28 features) entirely within GPU memory. The algorithm is made available as a plug-in within the XGBoost library and fully supports all XGBoost features including classification, regression and ranking tasks.
shows an example decision tree that can predict whether or not an individual owns a house. 87 The decision is based on their age and whether or not they have a job. The tree correctly classifies all 88 instances from Table 1 . 89 Decision tree algorithms typically expand nodes from the root in a greedy manner in order to maximise some criterion measuring the value of the split. For example, decision tree algorithms from the C4.5 family (Quinlan, 2014) , designed for classification, use information gain as the split criterion. Information gain describes a change in entropy H from some previous state to a new state. Entropy is defined as Where T is a set of labelled training instances, y 2 Y is an instance label and P(y) is the probability of drawing an instance with label y from T . Information gain is defined as Gradient boosting is a variation on boosting which represents the learning problem as gradient descent on some arbitrary differentiable loss function that measures the performance of the model on the training set. More specifically, the boosting algorithm executes M boosting iterations to learn a function F(x) that outputs predictionsŷ = F(x) minimising some loss function L (y,ŷ) . At each iteration we add a new estimator f (x) to try to correct the prediction of y for each training instance.
We can correct the model by setting f (x) to:
This fits the model f (x) for the current boosting iteration to the residuals y F m (x) of the previous 118 iteration. In practice, we approximate f (x), for example by using a depth limited decision tree. 119 This iterative process can be shown to be a gradient descent algorithm when the loss function is the squared error:
The loss over all training instances can be written as
We seek to minimise J by adjusting F(x i ). The gradient for a particular instance x i is given by
We can see that the residuals are the negative gradient of the squared error loss function:
f (x)=y F m (x)= dL(y, F(x)) dF (x) By adding a model that approximates this negative gradient to the ensemble we move closer to a local 120 minimum of the loss function, thus implementing gradient descent. 121
Generalised Gradient Boosting and XGBoost 122
Here we derive the XGBoost algorithm following the explanation in (Chen and Guestrin, 2016) . XGBoost 123 is a generalized gradient boosting implementation that includes a regularisation term, used to combat 124 overfitting, as well as support for arbitrary differentiable loss functions. 125 Instead of optimising plain squared error loss, an objective function with two parts is defined, a loss function over the training set as well as a regularisation term which penalises the complexity of the model:
L(y i ,ŷ i ) can be any convex differentiable loss function that measures the difference between the prediction and true label for a given training instance. Ω( f k ) describes the complexity of tree f k and is defined in the XGBoost algorithm (Chen and Guestrin, 2016) as
where T is the number of leaves of tree f k and w are the leaf weights (i.e., the predicted values stored Given that boosting proceeds in an iterative manner we can state the objective function for the current iteration m in terms of the prediction of the previous iterationŷ i (m 1) adjusted by the newest tree f k :
We can then optimise to find the f k which minimises our objective.
131
Taking the Taylor expansion of the above function to the second order allows us to easily accommodate different loss functions:
Here, g i and h i are the first and second order derivatives respectively of the loss function for instance i:
Note that the modelŷ i (m 1) is left unchanged during this optimisation process. The simplified objective function with constants removed is
We can also make the observation that a decision tree predicts constant values within a leaf. f k (x) can 133 then be represented as w q(x) where w is the vector containing scores for each leaf and q(x) maps instance 134
x to a leaf.
135
The objective function can then be modified to sum over the tree leaves and the regularization term from Equation 1:
Here, I j refers to the set of training instances in leaf j. The sums of the derivatives in each leaf can be 136 defined as follows:
Also note that w q(x) is a constant within each leaf and can be represented as w j . Simplifying we get
The weight w j for each leaf minimises the objective function at
The best leaf weight w j given the current tree structure is then
Using the best w j in Equation 2 the objective function for finding the best tree structure then becomes
Equation 3 is used in XGBoost as a measure of the quality of a given tree.
Feature Value 0.1 0.4 0.5 0.6 0.9 1.1 g i 0.1 0.8 0.2 -1.1 -0.2 -0.5 h i 1.0 1.0 1.0 1.0 1.0 1.0 G L 0.0 0.1 0.9 1.1 0.0 -0.2 H L 0.0 1.0 2.0 3.0 4.0 5.0 Given that it is intractable to enumerate through all possible tree structures we greedily expand the tree from the root node. In order to evaluate the usefulness of a given split we can look at the contribution of a single leaf node j to the objective function from Equation 3:
We can then consider the contribution to the objective function from splitting this leaf into two leaves:
The improvement to the objective function from creating the split is then defined as Gain = Ob j split Ob j lea f which yields
The quality of any given split separating a set of training instances is evaluated using the gain function in Table 2 shows an example set of instances in a leaf. We can assume we know the sums G and H 150 within this node as these are simply the G L or G R from the parent split. Therefore we have everything we 151 need to evaluate Gain for every possible split within these instances and select the best. Table 3 . If f2 is the feature to be predicted then an input training pair (x i , y i ) takes the
A data matrix within XGBoost may also contain missing values. One of the key features of XGBoost is the ability to store data in a sparse format by implicitly keeping track of missing values instead of physically storing them. While XGBoost does not directly 159 support categorical variables, the ability to efficiently store and process sparse input matrices allows us to 160 process categorical variables through one hot encoding. all missing examples down the left or right branch and select the best option. This is slightly complicated 175 to implement in practice as we do not know the gradient statistics of the missing values for any given 176 feature we are working on, although we do know the sum of all the gradient statistics for the current node.
177
The XGBoost algorithm handles this by performing two scans over the input data, the second being in 178 the reverse direction. In the first left to right scan the gradient statistics for the left direction are the scan 179 values maintained by the scan, the gradient statistics for the right direction are the sum gradient statistics 180 for this node minus the scan values. Hence, the right direction implicitly includes all of the missing values.
181
When scanning from right to left the reverse is true and the left direction includes all of the missing values.
182
The algorithm then selects the best split from either the forwards or backwards scan. 183
Graphics Processing Units

184
The purpose of this paper is to describe how to efficiently implement decision tree learning for XGBoost 185 on a GPU. GPUs can be thought of at a high level as having a shared memory architecture with multiple 186 SIMD (single instruction multiple data) processors. These SIMD processors operate in lockstep typically 187 in batches of 32 "threads" (Matloff, 2011) . GPUs are optimised for high throughput and work to 188 hide latency through the use of massive parallelism. This is in contrast to CPUs which use multiple 189 caches, branch prediction and speculative execution in order to optimize latency with regards to data 190 dependencies (Baxter, 2013). GPUs have been used to accelerate a variety of tasks traditionally run on 191 CPUs, providing significant speedups for parallelizable problems with a high arithmetic intensity. Of 192 particular relevance to machine learning is the use of GPUs to train extremely large neural networks.
193
It was shown in 2013 that 1 billion parameter networks could be trained in a few days on three GPU 
Languages and Libraries 197
The two main languages for general purpose GPU programming are CUDA and OpenCL. CUDA was 198 chosen for the implementation discussed in this paper due to the availability of optimised and production 199 ready libraries. The GPU tree construction algorithm would not be possible without a strong parallel 200 primitives library. We make extensive use of scan, reduce and radix sort primitives from the CUB (Merrill 207 CUDA code is written as a kernel to be executed by many thousands of threads. All threads execute 208 the same kernel function but their behaviour may be distinguished through a unique thread ID. Listing 1 209 7/28
Execution model
shows an example kernel adding values from two arrays into an output array. Indexing is determined by 210 the global thread ID and any unused threads are masked off with a branch statement.
211
Threads are grouped according to thread blocks that typically each contain some multiple of 32 212 threads. A group of 32 threads is known as a warp. Thread blocks are queued for execution on hardware 213 streaming multiprocessors. Streaming multiprocessors switch between different warps within a block 214 during program execution in order to hide latency. Global memory latency may be hundreds of cycles and 215 as such it is important to launch sufficiently many warps within a thread block to facilitate latency hiding.
216
A thread block provides no guarantees about the order of thread execution unless explicit memory 217 synchronization barriers are used. Synchronisation across thread blocks is not generally possible within a 218 single kernel launch. Device wide synchronization is achieved by multiple kernel launches. For example, 219 if a global synchronisation barrier is required within a kernel, the kernel must be separated into two 220 distinct kernels where synchronisation occurs between the kernel launches. 222 CUDA exposes three primary tiers of memory for reading and writing. Device wide global memory, 223 thread block accessible shared memory and thread local registers.
Memory architecture
224
• Global memory Global memory is accessible by all threads and has the highest latency. Input 225 data, output data and large amounts of working memory are typically stored in global memory.
226
Global memory can be copied from the device (i.e., the GPU) to the host computer and vice versa.
227
Bandwidth of host/device transfers is much slower than that of device/device transfers and should be 228 avoided if possible. Global memory is accessed in 128 byte cache lines on current GPUs. Memory 229 accesses should be coalesced in order to achieve maximum bandwidth. Coalescing refers to the 230 grouping of aligned memory load/store operations into a single transaction. For example, a fully 231 coalesced memory read occurs when a warp of 32 threads loads 32 contiguous 4 byte words (128 232 bytes). Fully uncoalesced reads (typical of gather operations) can limit device bandwidth to less 233 than 10% of peak bandwidth (Harris, 2013).
234
• Shared memory 48KB of shared memory is available to each thread block. Shared memory is 235 accessible by all threads in the block. Shared memory has a significantly lower latency than global 236 memory and is typically used as working storage within a thread block. It is sometimes described 237 as a "programmer managed cache". data parallel tasks can be expressed with simple programs without them, GPU primitives may be used 244 to compose more complicated algorithms while retaining high performance, readability and reliability.
245
Understanding which specific tasks can be achieved using parallel primitives and the relative performance 246 of GPU primitives as compared to their CPU counterparts is key to designing effective GPU algorithms. ). It is given in Algorithm 1. Figure 5 shows it in operation: we apply a simple scan with the addition 278 operator to an array of 1's. Given one thread for each input element the scan takes log 2 n = 3 iterations to 279 complete. The algorithm performs O(n log 2 n) addition operations.
280
Given that a sequential scan performs only n addition operations the simple parallel scan is not work 281 efficient. A work efficient parallel algorithm will perform the same number of operations as the sequential 282 algorithm and may provide significantly better performance in practice. A work efficient algorithm is 283 described in (Blelloch, 1990) . The algorithm is separated into two phases, an "upsweep" phase similar 284 to a reduction and a "downsweep" phase . Pseudocode for the upsweep (Algorithm 2) and downsweep 285 (Algorithm 3) phases are given below, following the implementation in (Harris et al., 2007) . 293 This is discussed further in Section 2.4.
294
A scan may also be implemented using warp intrinsics to create fast 32 item prefix sums based on the 295 simple scan in Figure 5 . Code for this is shown in Listing 3. Although the simple scan algorithm is not 296 work efficient, we use this approach for small arrays of size 32. the new position of all zero digits. All "1" digits must be placed after all "0" digits, therefore the final 302 positions of the "1"s can be calculated as the exclusive scan of the "1"s plus the total number of "0"s. The 303 exclusive scan of "1" digits does not need to be calculated as it can be inferred from the array index and 304 12/28 Table 6 . GPU vs CPU Scan Benchmark the exclusive scan of "0"s. For example at index 5 (using 0 based indexing), if our exclusive scan shows a 305 sum of 3 "0"s, then there must be two "1"s because a digit can only be 0 or 1.
306
The basic radix sort implementation only sorts unsigned integers but this can be extended to correctly Tables 5, 6 and 7 show that GPU primitive performance improves relative to the CPU algorithm as the 325 input size is increased, beginning to plateau at very large sizes as the GPU becomes saturated with work.
326
The relatively poor performance at small sizes is due to the overhead of launching GPU kernels. GPU 327 kernel launch times are profiled in (Boyer, 2016) and found to cost between 3 and 14 microseconds. Note 328 that the 1024 element reduction in Table 5 takes approximately 10 microseconds. At small sizes execution 329 time is dominated by kernel launch overhead, making GPU algorithms impractical for processing small 330 batches of data sequentially. Radix sort on the GPU still outperforms std::sort for 1024 elements, despite 331 the small input size. This is because much more work is being done compared to scan or reduction. The 332 kernel overhead is therefore less significant. At large sizes GPU radix sort shows dramatic performance 333 improvements over std::sort-up to two orders of magnitude. 
Scan and Reduce on Multiple Sequences
335
Variations on scan and reduce consider multiple sequences contained within the same input array and 336 identified by key flags. This is useful for building decision trees as the data can be repartitioned into 337 smaller and smaller groups as we build the tree. 338 We will describe an input array as containing either "interleaved" or "segmented" sequences. Table   339 8 shows an example of two interleaved sequences demarcated by flags. Its values are mixed up and do 340 not reside contiguously in memory. This is in contrast to Table 9, with two "segmented" sequences. The 341 segmented sequences reside contiguously in memory. 
Segmented Scan
343
A scan can be performed on the sequences from Table 9 using the conventional scan algorithm described 344 in Section 2.3.2 and modifying the binary associative operator to accept key value pairs. Listing 6 shows 345 an example of a binary associative operator that performs a segmented summation. It resets the sum when 
Segmented Reduce
348
A segmented reduction can be implemented efficiently by applying the segmented scan described above 349 and collecting the final value of each sequence. This is because the last element in a scan is equivalent to 350 a reduction. 
Interleaved Sequences: Multireduce
352
A reduction operation on interleaved sequences is commonly described as a multireduce operation. To 0  0  0  0  0  0  0  0  Instance Id  0  2  3  3  2  0  1 3 Feature value 0.1 0.5 0.9 5.2 3.1 3.6 3.9 4.7 to the approach presented in this work impractical. at a time and scale poorly at higher tree depths because they sequentially execute small batch sizes. Our 460 algorithm performs the following three high level phases for each tree level until the maximum tree depth 461 is reached: (1) find splits, (2) update node positions, and (3) sort node buckets (if necessary). 462
Phase 1: Find splits 463
The first phase of the algorithm finds the best split for each leaf node at the current level. 
Data Layout 465
To facilitate enumeration through all split points, the feature values should be kept in sorted order. Hence, 466 we use the device memory layout shown in Tables 12 and 13 . Each feature value is paired with the ID of 467 the instance it belongs to as well as the leaf node it currently resides in. Data is stored in sparse column 468 major format and instance IDs are used to map back to gradient pairs for each instance. All data is stored 469 in arrays in device memory. The tree itself can be stored in a fixed length device array as it is strictly 470 binary and has a maximum depth known ahead of time. 
Block Level Parallelism 472
Given the above data layout notice that each feature resides in a contiguous block and may be processed 473 independently. In order to calculate the best split for the root node of the tree we greedily select the best 474 split within each feature, delegating a single thread block per feature. The best splits for each feature are 475 then written out to global memory and are reduced by a second kernel. A downside of this approach is 476 that when the number of features is not enough to saturate the number of streaming multiprocessors-the 477 hardware units responsible for executing a thread block-the device will not be fully utilised. f0   f1  f2  Node Id  2  1  2  2  1  2  1  2  Instance Id  0  2  3  3  2  0  1 3 Feature value 0.1 0.5 0.9 5.2 3.1 3.6 3.9 4.7 Table 15. Interleaved Node Buckets   f0  f1  f2  Node Id  1  2  2  1  2  Instance Id  0  2  3  3  2  1  0 3 Feature value 0.5 0.1 0.9 5.2 3.1 3.9 3.6 4.7 if (node active ) { exclusive scan output = scan result ; } } in interleaved order. In a decision tree algorithm, when enumerating through feature values to find a 544 split, we should not choose a split that falls between two elements with the same value. This is because a 545 decision rule will not be able to separate elements with the same value. For a value to be considered as a 546 split the corresponding item must be the leftmost item with that feature value for that particular node (we 547 could also arbitrarily take the rightmost value).
548
Because the node buckets are interleaved it is not possible to simply check the item to the left to see 549 if the feature value is the same-the item to the left of a given item may reside in a different node. To 550 check if an item with a certain feature value is the leftmost item with that value in its node bucket we can 551 formulate a scan with a special binary associative operator. First each item is assigned a bit vectorx of 552 length n + 1 where n is the number of buckets. If the item resides within bucket i then x i will be set to 1.
553
If the item's feature value is distinct from the value of the item directly to the left (irrespective of bucket) 554 then x n+1 is set to 1. All other bits are set to 0.
555
We can then define a binary associative operator as follows:
Bit x n+1 acts as a segmentation flag, resetting the scan so many small scans are performed across 556 groups of items with the same feature value. Scanning the bucket flags with a logical or operator 557 determines which node buckets are represented in the items to the left of the current item. Therefore 558 within a group of items with the same feature value, if the current item's bucket flag is set to 0 for the 559 bucket it resides in, the item represents the leftmost item with that value in its bucket. This item can then 560 be used as a split point.
561
In practice a 64-bit integer is used as the bit vector in order to hold a maximum of 33 bits at the 6th 562 level of the tree. Given a reduction, scan and the above method for finding unique feature values we have all the machinery 569 necessary to enumerate splits and select the best. The complete algorithm for a thread block processing a 570 single feature at a given tree level is shown in Algorithm 7.
571
The output of this algorithm contains the best splits for each leaf node for a given feature. Each thread 572 block outputs the best splits for its assigned feature. These splits are then further reduced by a global 573 kernel to find the best splits for any feature. The sorting implementation of the split finding algorithm operates on feature value data grouped into node 576 buckets. Given data sorted by node ID first and then feature values second we can perform segmented 577 scan/reduce operations over an entire feature only needing a constant amount of temporary storage.
578
The segmented reduction to find gradient pair sums for each node is implemented as a segmented 579 sum scan, storing the final element from each segment as the node sum. Another segmented scan is then 580 performed over the input feature to get the exclusive scan of gradient pairs. After scanning each tile 581 the split gain is calculated using the scan and reduction as input and the best splits are stored in shared 582 memory.
583
The segmented scan is formulated by performing an ordinary scan over key value pairs with a binary 584 associative operator that resets the sum when the key changes. In this case the key is the current node 585 bucket and the value is the gradient pair. The operator is shown in Equation 6 for each feature, and each node. This is reduced by a global kernel to find the best splits for each node, of 589 any feature. does not have a value for that feature (missing value) will have its node position left unchanged as per the 597 missing direction. Because we now have the updated node position for each instance we write these node 598 positions back to each feature value.
599
To illustrate this with an example, Figure 9 shows the state of a decision tree after having calculated 600 splits for level 1. The node positions in the data structure used for split finding (Table 17 ) must be updated 601 before proceeding to calculate the splits for level 2. To do this we update the array in Table 18 that maps 602 instances to a node.
603
First we update the node ID map in the missing direction. All instances residing in node 1 are updated 604 in the right direction to node 4. Instances residing in node 2 are updated in the left direction to node 5. Id  1  2  2  1  1  2  2  Instance Id  0  2  1  3  0  1 2 Feature value 0.75 0.5 0.9 2.7 4.1 3.6 3.9 Figure 10 shows the performance of the GPU algorithm across varying problem sizes using configura-662 tion #1. The experiment is performed on subsets of the Bosch dataset using 20 boosting iterations. The
663
GPU algorithm's time increases linearly with respect to the number of input rows. It is approximately 664 equal to the CPU algorithm at 10,000 rows and always faster thereafter for this dataset. This gives an idea 665 of the minimum batch size at which the GPU algorithm begins to be effective.
666
In Figure 11 we show the performance of the Titan X from configuration #2 against configuration 667 #3 (a high-end 24 core server) on the Yahoo dataset with 500 boosting iterations and varying numbers 668 of threads. The Titan X outperforms the 24 core machine by approximately 1.2x, even if the number of 669 threads for the 24 core machine is chosen optimally. to keep using the interleaved approach. Note that for the first level the interleaved algorithm and the 677 sorting algorithm are equivalent as there is only one node bucket.
678
The accuracy shows some variance as the interleaved split finding algorithm records feature splits 679 in a slightly different way as compared to the sorting algorithm. Both versions split the training set in 680 exactly the same place but the sorting version records the feature value for the split as halfway between 681 two instances and the interleaved version records the split point as slightly less than the rightmost instance.
682
Because of this, when we use the model on the unseen test set the results can be marginally different for 683 the two versions.
684
Surprisingly the interleaved algorithm is still faster than the sorting algorithm at level 5 despite the 685 fact that the multiscan and multireduce operations must sequentially iterate over 2 5 = 32 nodes at each 686 step. This shows that executing instructions on elements held in registers or shared memory carries a very 687 low cost relative to uncoalesced reordering of elements in device memory, as is performed when radix 688 sorting. 690 We show the device memory consumption in Table 30 for all three benchmark datasets. Each dataset can 691 be fit entirely within the 12GB device memory of a Titan X card.
Memory consumption
692
In Table 31 we show the memory consumption of the original CPU algorithm for comparison. Host 693 memory consumption was evaluated using the valgrind massif 2 heap profiler tool. Device memory usage 694 2 http://valgrind.org/docs/manual/ms-manual.html
26/28
Dataset Device memory (GB) YLTR 4.03 Higgs 11.32 Bosch 8.28 
