Search CORE

7 research outputs found

Chapter 13 On the Efficient Implementation of a Real-time Kd-tree Construction Algorithm

Author: Byungjoon Chang
Insung Ihm
Woong Seo
Publication venue
Publication date: 24/04/2020
Field of study

Abstract: The kd-tree is one of the most commonly used spatial data structures for a variety of graphics applications because of its reliably high acceleration performance. Several years ago, Zhou et al. devised an effective kd-tree construction algorithm that runs entirely on a GPU. In this chapter, we present improved GPU programming techniques for implementing the algorithm more efficiently on current GPUs. One of the major ideas is to reduce the number of necessary kernel functions by replacing the essential, segmented-scan, and reduction computations by simpler per-block atomic operations, thereby alleviating the overheads from multiple synchronous kernel calls. Combined with the efficient implementation of intrablock scan and reduction, using recently introduced intrinsic functions, these changes achieve remarkable performance enhancement to the kd-tree construction process. Through an example of real-time ray tracing for dynamic scenes of nontrivial complexity, we demonstrate that the proposed GPU techniques can be exploited effectively for various real-time applications. Background and our contribution For many important applications in computer graphics, such as ray tracing and those relying on particle-based computations, adopting a proper acceleration structure will affect their run-time performance greatly. Among the variety of spatial data structures, the kd-tree is frequently used because of its reliably high acceleration performance. Compared to other techniques such as grids and boundingvolume hierarchies, its relatively higher construction cost has been regarded as a drawback, despite efforts to develop an optimized algorithm (e.g., [1] and Wu et al. In this chapter, we present enhanced CUDA programming techniques for implementing the GPU method of Zhou et al. Optimizations for the large-node stage In Zhou et al.'s method, the upper levels of the kd-tree were constructed using a node-splitting scheme that comprised spatial median splitting and empty-space maximizing. In particular, based on the observation that the assumptions made in the SAH may often be inaccurate for large nodes, this stage of computation, called the large-node stage, simply selects the spatial median of the longest axis of the axis-aligned bounding box (AABB) of a node as its split position. For efficient parallel implementation on a GPU, all triangles in each large node are grouped in-3 to chunks of fixed size (i.e., 256), parallelizing the computation over the triangles in the chunks. (Note that the triangles and chunks are mapped to the threads and blocks, respectively, in the CUDA implementation.) Triangle sorting with respect to splitting planes The large-node stage iterates the node-splitting process until no large node is left. In Algorithm 2 [11], the most time-consuming parts of each iteration are the fourth and fifth steps, corresponding to lines 24-34 and 35-40, respectively, where the triangles for each large node are first sorted with respect to the splitting plane, and the triangle numbers of the resulting two child nodes are then counted. In this subsection, we present two different approaches to implementing these two steps on a GPU. We then analyze their performance in the section on experimental results. Implementation using standard data-parallel primitives As was done in For each triangle in a large node, mapped to a CUDA thread, the key issue is how to efficiently calculate its address(es) in parallel in the new triangle index list next list, whose production is complicated because of the simultaneous subdivisions of the large nodes in the current list active list. For this, a kernel is first executed over every thread block corresponding to a chunk of triangles, classifying each triangle against the respective splitting plane, and generating two bit-flag sequences of size 256 per chunk triangle bit flags. Then, for each of these, an exclusive scan is performed using the shared memory of the GPU, resulting in the local triangle offset sequences. In addition, the kernel counts the number of triangles in each bit-flag sequence by simple addition, and places this number in an array in the global memory. (Note that, for the example in Implementation using atomic operations The triangle-sorting technique described in the previous subsection requires a segmented scan to be carried out twice on the data sequences stored in the global memory, and can easily be implemented using the data-parallel primitive functions provided by the CUDPP library [2], for example. Although very effective, such an approach forces the run-time execution to be split into a sequence of synchronous kernel calls, whose overheads will impact the run-time performance adversely. To address this, observe that a side effect of using a standard segmented-scan method is that the relative order of triangle indices within a large node made of multiple chunks is retained in the respective child nodes. Such a property is important when the order of elements is essential, as in a radix sort algorithm, for example. However, retaining the strict order is unnecessary in the kd-tree construction algorithm because the order of triangles within a kd-tree's leaf node is not critical in the later ray-tracing stage. This observation allows us to implement the triangle-sorting computation by using a single faster-running kernel and replacing the segmented-scan operations with simpler per-chunk atomic operations that are supported by the CUDA API. In the new implementation, the memory configuration for the triangle index lists is slightly different, as shown in For each chunk of triangle indices in the current list, the new kernel repeats the same computation until the triangle numbers are calculated in the array [A]. A representative thread then carries out two atomic additions, respectively fetching the local offsets, one for each child node, from the corresponding atomic variables and simultaneously adding the triangle counts to them, through which we will know where to start storing the sorted triangle indices in the child nodes. Then, once per child node, each thread checks the corresponding bit flag in the triangle bit flag array, and, if set to on, puts its triangle index in the proper place in the next triangle index list, whose location can easily be deduced from the fetched offset and the offset in the triangle offsets array. In this implementation, the two segmented scans over the arrays in the global memory have been replaced by two atomic-add operations per thread block. While the computation time is already reduced markedly by this change, two per-block scans, one for each child, must still be carried out per chunk to compute the triangle offsets. While such scans can be performed effectively in the shared memory by using a standard scan method AABB computations for active large nodes Another time-consuming part of the large-node stage is the second step (lines 9 to 14 of Algorithm 2), in which the AABB of all triangles in each node is calculated. The optimization techniques described in the previous subsection can also be applied to this AABB computation. The standard reduction in the shared memory for computing per-chunk bounding boxes can be implemented more efficiently on the GPU by a simple modification of the scan implementation using the intrinsic shuffle function __shfl_up(). Then, via three pairs of atomic min and max operations, the result of each chunk reduction is written in parallel to the location in the global memory that corresponds to the large node to which the chunk belongs. Although such atomic operations are still regarded as expensive on current GPUs, we observe that our single-kernel implementation based on atomic operations runs significantly faster on the GPU than the original implementation, which needed to perform segmented reductions six times. Optimizations for the small-node stage After all large nodes are split into nodes whose triangle numbers do not exceed 64, the small-node stage starts. Because sufficient nodes are available, the computation in this stage is parallelized over nodes instead of triangles, evaluating the precise SAH metric to find the best splitting plane for each small node. The key to the efficient implementation of this stage is exploiting a preprocessed data structure that facilitates the iterative node-splitting process. For each initial small node, called the small root node, up to 384 (= 64 (triangles) * 3 (x-, y-, z-axes) * 2 (min/max)) splitting-plane candidates are first collected from triangles in the node. Then, for each candidate, two 8-byte bit masks are generated to represent the triangle sets contained in both sides. To represent this information, 20 bytes of memory per node is necessary, including the 4 bytes used to store the location of the splitting plane, implying that up to 7,680 (=20 * 384) bytes of memory may be necessary for each small root node. It is important to choose an appropriate memory layout for the representation because the nontrivial amount of data will be accessed in parallel during the small-node stage. Although several different configurations are possible, we observed that the combination of a 4-byte access from the global memory for the splitting plane location and another 16-byte access from the texture memory for the triangle sets incurred the lowest memory latency on the GPU tested. (Our analysis of the generated PTX code showed that 16 bytes of data were fetched from texture memory even for a 4-byte access command.) With this representation, the SAH cost evaluation and triangle sorting in the subsequent node-splitting step can be performed efficiently using simple bitwise 8 operations. In this process, a parallel bit-counting operation is carried out very frequently to obtain the numbers of triangles in the child nodes. Whereas the method presented in Experimental results To measure the performance improvement achieved by the optimization techniques presented here, we first implemented the kd-tree construction algorithm of Zhou et al. on an NVIDIA GeForce GTX 680 GPU, effectively as described in the original paper. In doing this, we used the scan and reduction techniques described in Concluding remarks In this chapter, we have presented efficient GPU programming techniques for implementing the well-known kd-tree construction algorith

CiteSeerX

Otimização em GPU de bounding volume hierarchies para ray tracing

Author: Domingues Leonardo Rodrigo, 1985-
Publication venue: [s.n.]
Publication date: 27/08/2018
Field of study

Orientador: Hélio PedriniDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Métodos de Ray Tracing são conhecidos por produzir imagens extremamente realistas ao custo de um alto esforço computacional. Pouco após terem surgido, percebeu-se que a maior parte do custo associado a estes métodos está relacionada a encontrar a intersecção entre o grande número de raios que precisam ser traçados e a geometria da cena. Estruturas de dados especiais que indexam e organizam a geometria foram propostas para acelerar estes cálculos, de forma que apenas um subconjunto da geometria precise ser verificado para encontrar as intersecções. Dentre elas, podemos destacar as Bounding Volume Hierarchies (BVH), que são estruturas usadas para agrupar objetos 3D hierarquicamente. Recentemente, uma grande quantidade de esforços foi aplicada para acelerar a construção destas estruturas e aumentar sua qualidade. Este trabalho apresenta um novo método para a construção de BVHs de alta qualidade em sistemas manycore. O método em questão é uma extensão do atual estado da arte na construção de BVHs em GPU, Treelet Restructuring Bounding Volume Hierarchy (TRBVH), e consiste em otimizar uma árvore já existente reorganizando subconjuntos de seus nós através de uma abordagem de agrupamento aglomerativo. A implementação deste método foi feita para a arquitetura Kepler utilizando CUDA e foi testada em dezesseis cenas que são comumente usadas para avaliar o desempenho de estruturas aceleradoras. É demonstrado que esta implementação é capaz de produzir árvores com qualidade comparável às geradas utilizando TRBVH para aquelas cenas, além de ser 30% mais rápidaAbstract: Ray tracing methods are well known for producing very realistic images at the expense of a high computational effort. Most of the cost associated with those methods comes from finding the intersection between the massive number of rays that need to be traced and the scene geometry. Special data structures were proposed to speed up those calculations by indexing and organizing the geometry so that only a subset of it has to be effectively checked for intersections. One such construct is the Bounding Volume Hierarchy (BVH), which is a tree-like structure used to group 3D objects hierarchically. Recently, a significant amount of effort has been put into accelerating the construction of those structures and increasing their quality. We present a new method for building high-quality BVHs on manycore systems. Our method is an extension of the current state-of-the-art on GPU BVH construction, Treelet Restructuring Bounding Volume Hierarchy (TRBVH), and consists of optimizing an already existing tree by rearranging subsets of its nodes using an agglomerative clustering approach. We implemented our solution for the NVIDIA Kepler architecture using CUDA and tested it on sixteen distinct scenes that are commonly used to evaluate the performance of acceleration structures. We show that our implementation is capable of producing trees whose quality is equivalent to the ones generated by TRBVH for those scenes, while being about 30% faster to do soMestradoCiência da ComputaçãoMestre em Ciência da Computaçã

Repositorio da Producao Cientifica e Intelectual da Unicamp

Evaluating the Performance of Vulkan GLSL Compute Shaders in Real-Time Ray-Traced Audio Propagation Through 3D Virtual Environments

Author: Buggy James
Publication venue: Technological University Dublin
Publication date: 01/01/2023
Field of study

Real time ray tracing is a growing area of interest with applications in audio processing. However, real time audio processing comes with strict performance requirements, which parallel computing is often used to overcome. As graphics processing units (GPUs) have become more powerful and programmable, general-purpose computing on graphics processing units (GPGPU) has allowed GPUs to become extremely powerful parallel processors, leading them to become more prevalent in the domain of audio processing through platforms such as CUDA. The aim of this research was to investigate the potential of GLSL compute shaders in the domain of real time audio processing. Specifically regarding real time ray tracing tasks. To do this a number of GLSL compute shaders were created, along with a C++ Vulkan application with which to execute them. These shaders facilitate the propagation of audio, using ray tracing, through a virtual environment, and implement 3D space partitioning and ray intersection prediction in order to gauge the effectiveness of these optimisations for this task. Statistically significant results show that the GLSL compute shaders successfully propagated audio through a virtual environment, returning results to the host system in real time, within 30 milliseconds. However, while this capability was shown, significantly detailed virtual environments prevented results from being returned in real time. Indicating a potential for future research and optimisation

Arrow@TUDublin

Ray tracing of dynamic scenes

Author: Günther Johannes
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2014
Field of study

In the last decade ray tracing performance reached interactive frame rates for nontrivial scenes, which roused the desire to also ray trace dynamic scenes. Changing the geometry of a scene, however, invalidates the precomputed auxiliary data-structures needed to accelerate ray tracing. In this thesis we review and discuss several approaches to deal with the challenge of ray tracing dynamic scenes. In particular we present the motion decomposition approach that avoids the invalidation of acceleration structures due to changing geometry. To this end, the animated scene is analyzed in a preprocessing step to split it into coherently moving parts. Because the relative movement of the primitives within each part is small it can be handled by special, pre-built kd-trees. Motion decomposition enables ray tracing of predefined animations and skinned meshed at interactive frame rates. Our second main contribution is the streamed binning approach. It approximates the evaluation of the cost function that governs the construction of optimized kd-trees and BVHs. As a result, construction speed especially for BVHs can be increased by one order of magnitude while still maintaining their high quality for ray tracing.Im letzten Jahrzehnt wurden interaktive Bildwiederholraten bei dem Raytracen von nicht trivialen Szenen erreicht. Dies hat den Wunsch geweckt, auch sich verändernde Szenen mit Raytracing darstellen zu können. Allerdings werden die vorberechneten Datenstrukturen, welche für die Beschleunigung von Raytracing gebraucht werden, durch Veränderungen an der Geometrie einer Szene unbrauchbar gemacht. In dieser Dissertation untersuchen und diskutieren wir mehrere Lösungsansätze für das Problem der Darstellung von sich verändernden Szenen mittels Raytracings. Insbesondere stellen wir den Motion Decomposition Ansatz vor, welcher die bisher nötige Neuberechnung der Beschleunigungsdatenstrukturen aufgrund von Geometrieänderungen zu einem großen Teil vermeidet. Dazu wird in einem Vorberechnungsschritt die animierte Szene untersucht und diese in sich ähnlich bewegende Teile zerlegt. Da dadurch die relative Bewegung der Primitiven der Teilszenen zueinander sehr klein ist kann sie durch spezielle, vorberechnete kd-Bäume toleriert werden. Motion Decomposition ermöglicht das Raytracen von vordefinierte Animationen und Skinned Meshes mit interaktiven Bildwiederholraten. Unser zweiten Hauptbeitrag ist der Streamed Binning Ansatz. Dabei wird die Kostenfunktion, welche die Konstruktion von für Raytracing optimierten kd-Bäumen und BVHs steuert, näherungsweise ausgewertet, wobei deren Qualität kaum beeinträchtigt wird. Im Ergebnis wird insbesondere die Zeit für den Aufbau von BVHs um eine Größenordnung reduziert

Acronym

Ray tracing of dynamic scenes

Author: Günther Johannes
Publication venue: Fakultät 6 - Naturwissenschaftlich-Technische Fakultät I. Fachrichtung 6.2 - Informatik
Publication date: 01/01/2014
Field of study

Universaar

MPG.PuRe

Acronym

A Robust and Efficient Rendering Method by Formulating Computational Efficiency

Author: Nabata Kosuke
Publication venue
Publication date: 28/05/2021
Field of study

Wakayama University Academic Repository

Foundations and Methods for GPU based Image Synthesis

Author: Widmer Sven
Publication venue
Publication date: 01/01/2018
Field of study

Effects such as global illumination, caustics, defocus and motion blur are an integral part of generating images that are perceived as realistic pictures and cannot be distinguished from photographs. In general, two different approaches exist to render images: ray tracing and rasterization. Ray tracing is a widely used technique for production quality rendering of images. The image quality and physical correctness are more important than the time needed for rendering. Generating these effects is a very compute and memory intensive process and can take minutes to hours for a single camera shot. Rasterization on the other hand is used to render images if real-time constraints have to be met (e.g. computer games). Often specialized algorithms are used to approximate these complex effects to achieve plausible results while sacrificing image quality for performance. This thesis is split into two parts. In the first part we look at algorithms and load-balancing schemes for general purpose computing on graphics processing units (GPUs). Most of the ray tracing related algorithms (e.g. KD-tree construction or bidirectional path tracing) have unpredictable memory requirements. Dynamic memory allocation on GPUs suffers from global synchronization required to keep the state of current allocations. We present a method to reduce this overhead on massively parallel hardware architectures. In particular, we merge small parallel allocation requests from different threads that can occur while exploiting SIMD style parallelism. We speed-up the dynamic allocation using a set of constraints that can be applied to a large class of parallel algorithms. To achieve the image quality needed for feature films GPU-cluster are often used to cope with the amount of computation needed. We present a framework that employs a dynamic load balancing approach and applies fair scheduling to minimize the average execution time of spawned computational tasks. The load balancing capabilities are shown by handling irregular workloads: a bidirectional path tracer allowing renderings of complex effects at near interactive frame rates. In the second part of the thesis we try to reduce the image quality gap between production and real-time rendering. Therefore, an adaptive acceleration structure for screen-space ray tracing is presented that represents the scene geometry by planar approximations. The benefit is a fast method to skip empty space and compute exact intersection points based on the planar approximation. This technique allows simulating complex phenomena including depth-of-field rendering and ray traced reflections at real-time frame rates. To handle motion blur in combination with transparent objects we present a unified rendering approach that decouples space and time sampling. Thereby, we can achieve interactive frame rates by reusing fragments during the sampling step. The scene geometry that is potentially visible at any point in time for the duration of a frame is rendered in a rasterization step and stored in temporally varying fragments. We perform spatial sampling to determine all temporally varying fragments that intersect with a specific viewing ray at any point in time. Viewing rays can be sampled according to the lens uv-sampling to incorporate depth-of-field. In a final temporal sampling step, we evaluate the pre-determined viewing ray/fragment intersections for one or multiple points in time. This allows incorporating standard shading effects including and resulting in a physically plausible motion and defocus blur for transparent and opaque objects

TUbiblio

tuprints