Chapter 13 On the Efficient Implementation of a Real-time Kd-tree Construction Algorithm

Abstract

Abstract: The kd-tree is one of the most commonly used spatial data structures for a variety of graphics applications because of its reliably high acceleration performance. Several years ago, Zhou et al. devised an effective kd-tree construction algorithm that runs entirely on a GPU. In this chapter, we present improved GPU programming techniques for implementing the algorithm more efficiently on current GPUs. One of the major ideas is to reduce the number of necessary kernel functions by replacing the essential, segmented-scan, and reduction computations by simpler per-block atomic operations, thereby alleviating the overheads from multiple synchronous kernel calls. Combined with the efficient implementation of intrablock scan and reduction, using recently introduced intrinsic functions, these changes achieve remarkable performance enhancement to the kd-tree construction process. Through an example of real-time ray tracing for dynamic scenes of nontrivial complexity, we demonstrate that the proposed GPU techniques can be exploited effectively for various real-time applications. Background and our contribution For many important applications in computer graphics, such as ray tracing and those relying on particle-based computations, adopting a proper acceleration structure will affect their run-time performance greatly. Among the variety of spatial data structures, the kd-tree is frequently used because of its reliably high acceleration performance. Compared to other techniques such as grids and boundingvolume hierarchies, its relatively higher construction cost has been regarded as a drawback, despite efforts to develop an optimized algorithm (e.g., [1] and Wu et al. In this chapter, we present enhanced CUDA programming techniques for implementing the GPU method of Zhou et al. Optimizations for the large-node stage In Zhou et al.'s method, the upper levels of the kd-tree were constructed using a node-splitting scheme that comprised spatial median splitting and empty-space maximizing. In particular, based on the observation that the assumptions made in the SAH may often be inaccurate for large nodes, this stage of computation, called the large-node stage, simply selects the spatial median of the longest axis of the axis-aligned bounding box (AABB) of a node as its split position. For efficient parallel implementation on a GPU, all triangles in each large node are grouped in-3 to chunks of fixed size (i.e., 256), parallelizing the computation over the triangles in the chunks. (Note that the triangles and chunks are mapped to the threads and blocks, respectively, in the CUDA implementation.) Triangle sorting with respect to splitting planes The large-node stage iterates the node-splitting process until no large node is left. In Algorithm 2 [11], the most time-consuming parts of each iteration are the fourth and fifth steps, corresponding to lines 24-34 and 35-40, respectively, where the triangles for each large node are first sorted with respect to the splitting plane, and the triangle numbers of the resulting two child nodes are then counted. In this subsection, we present two different approaches to implementing these two steps on a GPU. We then analyze their performance in the section on experimental results. Implementation using standard data-parallel primitives As was done in For each triangle in a large node, mapped to a CUDA thread, the key issue is how to efficiently calculate its address(es) in parallel in the new triangle index list next list, whose production is complicated because of the simultaneous subdivisions of the large nodes in the current list active list. For this, a kernel is first executed over every thread block corresponding to a chunk of triangles, classifying each triangle against the respective splitting plane, and generating two bit-flag sequences of size 256 per chunk triangle bit flags. Then, for each of these, an exclusive scan is performed using the shared memory of the GPU, resulting in the local triangle offset sequences. In addition, the kernel counts the number of triangles in each bit-flag sequence by simple addition, and places this number in an array in the global memory. (Note that, for the example in Implementation using atomic operations The triangle-sorting technique described in the previous subsection requires a segmented scan to be carried out twice on the data sequences stored in the global memory, and can easily be implemented using the data-parallel primitive functions provided by the CUDPP library [2], for example. Although very effective, such an approach forces the run-time execution to be split into a sequence of synchronous kernel calls, whose overheads will impact the run-time performance adversely. To address this, observe that a side effect of using a standard segmented-scan method is that the relative order of triangle indices within a large node made of multiple chunks is retained in the respective child nodes. Such a property is important when the order of elements is essential, as in a radix sort algorithm, for example. However, retaining the strict order is unnecessary in the kd-tree construction algorithm because the order of triangles within a kd-tree's leaf node is not critical in the later ray-tracing stage. This observation allows us to implement the triangle-sorting computation by using a single faster-running kernel and replacing the segmented-scan operations with simpler per-chunk atomic operations that are supported by the CUDA API. In the new implementation, the memory configuration for the triangle index lists is slightly different, as shown in For each chunk of triangle indices in the current list, the new kernel repeats the same computation until the triangle numbers are calculated in the array [A]. A representative thread then carries out two atomic additions, respectively fetching the local offsets, one for each child node, from the corresponding atomic variables and simultaneously adding the triangle counts to them, through which we will know where to start storing the sorted triangle indices in the child nodes. Then, once per child node, each thread checks the corresponding bit flag in the triangle bit flag array, and, if set to on, puts its triangle index in the proper place in the next triangle index list, whose location can easily be deduced from the fetched offset and the offset in the triangle offsets array. In this implementation, the two segmented scans over the arrays in the global memory have been replaced by two atomic-add operations per thread block. While the computation time is already reduced markedly by this change, two per-block scans, one for each child, must still be carried out per chunk to compute the triangle offsets. While such scans can be performed effectively in the shared memory by using a standard scan method AABB computations for active large nodes Another time-consuming part of the large-node stage is the second step (lines 9 to 14 of Algorithm 2), in which the AABB of all triangles in each node is calculated. The optimization techniques described in the previous subsection can also be applied to this AABB computation. The standard reduction in the shared memory for computing per-chunk bounding boxes can be implemented more efficiently on the GPU by a simple modification of the scan implementation using the intrinsic shuffle function __shfl_up(). Then, via three pairs of atomic min and max operations, the result of each chunk reduction is written in parallel to the location in the global memory that corresponds to the large node to which the chunk belongs. Although such atomic operations are still regarded as expensive on current GPUs, we observe that our single-kernel implementation based on atomic operations runs significantly faster on the GPU than the original implementation, which needed to perform segmented reductions six times. Optimizations for the small-node stage After all large nodes are split into nodes whose triangle numbers do not exceed 64, the small-node stage starts. Because sufficient nodes are available, the computation in this stage is parallelized over nodes instead of triangles, evaluating the precise SAH metric to find the best splitting plane for each small node. The key to the efficient implementation of this stage is exploiting a preprocessed data structure that facilitates the iterative node-splitting process. For each initial small node, called the small root node, up to 384 (= 64 (triangles) * 3 (x-, y-, z-axes) * 2 (min/max)) splitting-plane candidates are first collected from triangles in the node. Then, for each candidate, two 8-byte bit masks are generated to represent the triangle sets contained in both sides. To represent this information, 20 bytes of memory per node is necessary, including the 4 bytes used to store the location of the splitting plane, implying that up to 7,680 (=20 * 384) bytes of memory may be necessary for each small root node. It is important to choose an appropriate memory layout for the representation because the nontrivial amount of data will be accessed in parallel during the small-node stage. Although several different configurations are possible, we observed that the combination of a 4-byte access from the global memory for the splitting plane location and another 16-byte access from the texture memory for the triangle sets incurred the lowest memory latency on the GPU tested. (Our analysis of the generated PTX code showed that 16 bytes of data were fetched from texture memory even for a 4-byte access command.) With this representation, the SAH cost evaluation and triangle sorting in the subsequent node-splitting step can be performed efficiently using simple bitwise 8 operations. In this process, a parallel bit-counting operation is carried out very frequently to obtain the numbers of triangles in the child nodes. Whereas the method presented in Experimental results To measure the performance improvement achieved by the optimization techniques presented here, we first implemented the kd-tree construction algorithm of Zhou et al. on an NVIDIA GeForce GTX 680 GPU, effectively as described in the original paper. In doing this, we used the scan and reduction techniques described in Concluding remarks In this chapter, we have presented efficient GPU programming techniques for implementing the well-known kd-tree construction algorith

    Similar works