205 research outputs found

    A sparse octree gravitational N-body code that runs entirely on the GPU processor

    Get PDF
    We present parallel algorithms for constructing and traversing sparse octrees on graphics processing units (GPUs). The algorithms are based on parallel-scan and sort methods. To test the performance and feasibility, we implemented them in CUDA in the form of a gravitational tree-code which completely runs on the GPU.(The code is publicly available at: http://castle.strw.leidenuniv.nl/software.html) The tree construction and traverse algorithms are portable to many-core devices which have support for CUDA or OpenCL programming languages. The gravitational tree-code outperforms tuned CPU code during the tree-construction and shows a performance improvement of more than a factor 20 overall, resulting in a processing rate of more than 2.8 million particles per second.Comment: Accepted version. Published in Journal of Computational Physics. 35 pages, 12 figures, single colum

    Lichttransportsimulation auf Spezialhardware

    Get PDF
    It cannot be denied that the developments in computer hardware and in computer algorithms strongly influence each other, with new instructions added to help with video processing, encryption, and in many other areas. At the same time, the current cap on single threaded performance and wide availability of multi-threaded processors has increased the focus on parallel algorithms. Both influences are extremely prominent in computer graphics, where the gaming and movie industries always strive for the best possible performance on the current, as well as future, hardware. In this thesis we examine the hardware-algorithm synergies in the context of ray tracing and Monte-Carlo algorithms. First, we focus on the very basic element of all such algorithms - the casting of rays through a scene, and propose a dedicated hardware unit to accelerate this common operation. Then, we examine existing and novel implementations of many Monte-Carlo rendering algorithms on massively parallel hardware, as full hardware utilization is essential for peak performance. Lastly, we present an algorithm for tackling complex interreflections of glossy materials, which is designed to utilize both powerful processing units present in almost all current computers: the Centeral Processing Unit (CPU) and the Graphics Processing Unit (GPU). These three pieces combined show that it is always important to look at hardware-algorithm mapping on all levels of abstraction: instruction, processor, and machine.Zweifelsohne beeinflussen sich Computerhardware und Computeralgorithmen gegenseitig in ihrer Entwicklung: Prozessoren bekommen neue Instruktionen, um zum Beispiel Videoverarbeitung, Verschlüsselung oder andere Anwendungen zu beschleunigen. Gleichzeitig verstärkt sich der Fokus auf parallele Algorithmen, bedingt durch die limitierte Leistung von für einzelne Threads und die inzwischen breite Verfügbarkeit von multi-threaded Prozessoren. Beide Einflüsse sind im Grafikbereich besonders stark , wo es z.B. für die Spiele- und Filmindustrie wichtig ist, die bestmögliche Leistung zu erreichen, sowohl auf derzeitiger und zukünftiger Hardware. In Rahmen dieser Arbeit untersuchen wir die Synergie von Hardware und Algorithmen anhand von Ray-Tracing- und Monte-Carlo-Algorithmen. Zuerst betrachten wir einen grundlegenden Hardware-Bausteins für alle diese Algorithmen, die Strahlenverfolgung in einer Szene, und präsentieren eine spezielle Hardware-Einheit zur deren Beschleunigung. Anschließend untersuchen wir existierende und neue Implementierungen verschiedener MonteCarlo-Algorithmen auf massiv-paralleler Hardware, wobei die maximale Auslastung der Hardware im Fokus steht. Abschließend stellen wir dann einen Algorithmus zur Berechnung von komplexen Beleuchtungseffekten bei glänzenden Materialien vor, der versucht, die heute fast überall vorhandene Kombination aus Hauptprozessor (CPU) und Grafikprozessor (GPU) optimal auszunutzen. Zusammen zeigen diese drei Aspekte der Arbeit, wie wichtig es ist, Hardware und Algorithmen auf allen Ebenen gleichzeitig zu betrachten: Auf den Ebenen einzelner Instruktionen, eines Prozessors bzw. eines gesamten Systems

    Improving GPU SIMD Control Flow Efficiency via Hybrid Warp Size Mechanism

    Get PDF
    High single instruction multiple data (SIMD) efficiency and low power consumption have made graphic processing units (GPUs) an ideal platform for many complex computational applications. Thousands of threads can be created by programmers and grouped into fixed-size SIMD batches, known as warps. High throughput is then achieved by concurrently executing such warps with minimal control overhead. However, if a branch instruction occurs, which assigns different paths to different threads, this warp will be broken into multiple warps that have to be executed serially, consequently reducing the efficiency advantage of SIMD. In this thesis, the contemporary fixed-size warp design is abandoned and a hybrid warp size (HWS) mechanism is proposed. Mixed-size warps are generated according to HWS and are scheduled and issued flexibly. Once a branch divergence occurs, split warps are squeezed according to the proposed algorithm, and warp sizes are downscaled wherever applicable. Based on updated warp sizes, warp schedulers calculate the number of cycles the current warp needs and issue the next warp accordingly. As a result, hybrid warps are pushed into pipelines as soon as possible and more pipeline stages are overlapped. The simulation results show that this mechanism yields an average speedup of 1.20 over the baseline architecture for a wide variety of general purpose GPU applications. This work also integrates HWS with dynamic warp formation (DWF), which is a well-known branch handling mechanism aimed at improving SIMD utilization by forming new warps out of split warps in real time. The warp forming policy is modified to better tolerate warp conflicts. Also, squeeze operations are added before a warp merges with other warps. The simulation shows that the combination of DWF and HWS generates an average speedup of 1.27 over the DWF-only platform for the same set of GPU benchmarks

    Faster Ray Tracing through Hierarchy Cut Code

    Full text link
    We propose a novel ray reordering technique to accelerate the ray tracing process by encoding and sorting rays prior to traversal. Instead of spatial coordinates, our method encodes rays according to the cuts of the hierarchical acceleration structure, which is called the hierarchy cut code. This approach can better adapt to the acceleration structure and obtain a more reliable encoding result. We also propose a compression scheme to decrease the sorting overhead by a shorter sorting key. In addition, based on the phenomenon of boundary drift, we theoretically explain the reason why existing reordering methods cannot achieve better performance by using longer sorting keys. The experiment demonstrates that our method can accelerate secondary ray tracing by up to 1.81 times, outperforming the existing methods. Such result proves the effectiveness of hierarchy cut code, and indicate that the reordering technique can achieve greater performance improvement, which worth further research

    Parallel marching blocks: a practical isosurfacing algorithm for large data on many-core architectures

    Get PDF
    Interactive isosurface visualisation has been made possible by mapping algorithms to GPU architectures. However, current state-of-the-art isosurfacing algorithms usually consume large amounts of GPU memory owing to the additional acceleration structures they require. As a result, the continued limitations on available GPU memory mean that they are unable to deal with the larger datasets that are now increasingly becoming prevalent. This paper proposes a new parallel isosurface-extraction algorithm that exploits the blocked organisation of the parallel threads found in modern many-core platforms to achieve fast isosurface extraction and reduce the associated memory requirements. This is achieved by optimising thread co-operation within thread-blocks and reducing redundant computation; ultimately, an indexed triangular mesh could be produced. Experiments have shown that the proposed algorithm is much faster (up to 10×) than state-of-the-art GPU algorithms and has a much smaller memory footprint, enabling it to handle much larger datasets (up to 64×) on the same GPU.
    corecore