12 research outputs found

    Generating renderers

    Get PDF
    Most production renderers developed for the film industry are huge pieces of software that are able to render extremely complex scenes. Unfortunately, they are implemented using the currently available programming models that are not well suited to modern computing hardware like CPUs with vector units or GPUs. Thus, they have to deal with the added complexity of expressing parallelism and using hardware features in those models. Since compilers cannot alone optimize and generate efficient programs for any type of hardware, because of the large optimization spaces and the complexity of the underlying compiler problems, programmers have to rely on compiler-specific hardware intrinsics or write non-portable code. The consequence of these limitations is that programmers resort to writing the same code twice when they need to port their algorithm on a different architecture, and that the code itself becomes difficult to maintain, as algorithmic details are buried under hardware details. Thankfully, there are solutions to this problem, taking the form of Domain-Specific Lan- guages. As their name suggests, these languages are tailored for one domain, and compilers can therefore use domain-specific knowledge to optimize algorithms and choose the best execution policy for a given target hardware. In this thesis, we opt for another way of encoding domain- specific knowledge: We implement a generic, high-level, and declarative rendering and traversal library in a functional language, and later refine it for a target machine by providing partial evaluation annotations. The partial evaluator then specializes the entire renderer according to the available knowledge of the scene: Shaders are specialized when their inputs are known, and in general, all redundant computations are eliminated. Our results show that the generated renderers are faster and more portable than renderers written with state-of-the-art competing libraries, and that in comparison, our rendering library requires less implementation effort.Die meisten in der Filmindustrie zum Einsatz kommenden Renderer sind riesige Softwaresysteme, die in der Lage sind, extrem aufwendige Szenen zu rendern. Leider sind diese mit den aktuell verfügbaren Programmiermodellen implementiert, welche nicht gut geeignet sind für moderne Rechenhardware wie CPUs mit Vektoreinheiten oder GPUs. Deshalb müssen Entwickler sich mit der zusätzlichen Komplexität auseinandersetzen, Parallelismus und Hardwarefunktionen in diesen Programmiermodellen auszudrücken. Da Compiler nicht selbständig optimieren und effiziente Programme für jeglichen Typ Hardware generieren können, wegen des großen Optimierungsraumes und der Komplexität des unterliegenden Kompilierungsproblems, müssen Programmierer auf Compiler-spezifische Hardware-“Intrinsics” zurückgreifen, oder nicht portierbaren Code schreiben. Die Konsequenzen dieser Limitierungen sind, dass Programmierer darauf zurückgreifen den gleichen Code zweimal zu schreiben, wenn sie ihre Algorithmen für eine andere Architektur portieren müssen, und dass der Code selbst schwer zu warten wird, da algorithmische Details unter Hardwaredetails verloren gehen. Glücklicherweise gibt es Lösungen für dieses Problem, in der Form von DSLs. Diese Sprachen sind maßgeschneidert für eine Domäne und Compiler können deshalb Domänenspezifisches Wissen nutzen, um Algorithmen zu optimieren und die beste Ausführungsstrategie für eine gegebene Zielhardware zu wählen. In dieser Dissertation wählen wir einen anderen Weg, Domänenspezifisches Wissen zu enkodieren: Wir implementieren eine generische, high-level und deklarative Rendering- und Traversierungsbibliothek in einer funktionalen Programmiersprache, und verfeinern sie später für eine Zielmaschine durch Bereitstellung von Annotationen für die partielle Auswertung. Der “Partial Evaluator” spezialisiert dann den kompletten Renderer, basierend auf dem verfügbaren Wissen über die Szene: Shader werden spezialisiert, wenn ihre Eingaben bekannt sind, und generell werden alle redundanten Berechnungen eliminiert. Unsere Ergebnisse zeigen, dass die generierten Renderer schneller und portierbarer sind, als Renderer geschrieben mit den aktuellen Techniken konkurrierender Bibliotheken und dass, im Vergleich, unsere Rendering Bibliothek weniger Implementierungsaufwand erfordert.This work was supported by the Federal Ministry of Education and Research (BMBF) as part of the Metacca and ProThOS projects as well as by the Intel Visual Computing Institute (IVCI) and Cluster of Excellence on Multimodal Computing and Interaction (MMCI) at Saarland University. Parts of it were also co-funded by the European Union(EU), as part of the Dreamspace project

    Lichttransportsimulation auf Spezialhardware

    Get PDF
    It cannot be denied that the developments in computer hardware and in computer algorithms strongly influence each other, with new instructions added to help with video processing, encryption, and in many other areas. At the same time, the current cap on single threaded performance and wide availability of multi-threaded processors has increased the focus on parallel algorithms. Both influences are extremely prominent in computer graphics, where the gaming and movie industries always strive for the best possible performance on the current, as well as future, hardware. In this thesis we examine the hardware-algorithm synergies in the context of ray tracing and Monte-Carlo algorithms. First, we focus on the very basic element of all such algorithms - the casting of rays through a scene, and propose a dedicated hardware unit to accelerate this common operation. Then, we examine existing and novel implementations of many Monte-Carlo rendering algorithms on massively parallel hardware, as full hardware utilization is essential for peak performance. Lastly, we present an algorithm for tackling complex interreflections of glossy materials, which is designed to utilize both powerful processing units present in almost all current computers: the Centeral Processing Unit (CPU) and the Graphics Processing Unit (GPU). These three pieces combined show that it is always important to look at hardware-algorithm mapping on all levels of abstraction: instruction, processor, and machine.Zweifelsohne beeinflussen sich Computerhardware und Computeralgorithmen gegenseitig in ihrer Entwicklung: Prozessoren bekommen neue Instruktionen, um zum Beispiel Videoverarbeitung, Verschlüsselung oder andere Anwendungen zu beschleunigen. Gleichzeitig verstärkt sich der Fokus auf parallele Algorithmen, bedingt durch die limitierte Leistung von für einzelne Threads und die inzwischen breite Verfügbarkeit von multi-threaded Prozessoren. Beide Einflüsse sind im Grafikbereich besonders stark , wo es z.B. für die Spiele- und Filmindustrie wichtig ist, die bestmögliche Leistung zu erreichen, sowohl auf derzeitiger und zukünftiger Hardware. In Rahmen dieser Arbeit untersuchen wir die Synergie von Hardware und Algorithmen anhand von Ray-Tracing- und Monte-Carlo-Algorithmen. Zuerst betrachten wir einen grundlegenden Hardware-Bausteins für alle diese Algorithmen, die Strahlenverfolgung in einer Szene, und präsentieren eine spezielle Hardware-Einheit zur deren Beschleunigung. Anschließend untersuchen wir existierende und neue Implementierungen verschiedener MonteCarlo-Algorithmen auf massiv-paralleler Hardware, wobei die maximale Auslastung der Hardware im Fokus steht. Abschließend stellen wir dann einen Algorithmus zur Berechnung von komplexen Beleuchtungseffekten bei glänzenden Materialien vor, der versucht, die heute fast überall vorhandene Kombination aus Hauptprozessor (CPU) und Grafikprozessor (GPU) optimal auszunutzen. Zusammen zeigen diese drei Aspekte der Arbeit, wie wichtig es ist, Hardware und Algorithmen auf allen Ebenen gleichzeitig zu betrachten: Auf den Ebenen einzelner Instruktionen, eines Prozessors bzw. eines gesamten Systems

    Higher Performance Traversal and Construction of Tree-Based Raytracing Acceleration Structures

    Get PDF
    Ray tracing is an important computational primitive used in different algorithms including collision detection, line-of-sight computations, ray tracing-based sound propagation, and most prominently light transport algorithms. It computes the closest intersections for a given set of rays and geometry. The geometry is usually modeled with a set of geometric primitives such as triangles or quadrangles which define a scene. An efficient ray tracing implementation needs to rely on an acceleration structure to decouple ray tracing complexity from scene complexity as far as possible. The most common ray tracing acceleration structures are kd-trees and bounding volume hierarchies (BVHs) which have an O(log n) ray tracing complexity in the number of scene primitives. Both structures offer similar ray tracing performance in practice. This thesis presents theoretical insights and practical approaches for higher quality, improved graphics processing unit (GPU) ray tracing performance, and faster construction of BVHs and kd-trees, where the focus is on BVHs. The chosen construction strategy for BVHs and kd-trees has a significant impact on final ray tracing performance. The most common measure for the quality of BVHs and kd-trees is the surface area metric (SAM). Using assumptions on the distribution of ray origins and directions the SAM gives an approximation for the cost of traversing an acceleration structure without having to trace a single ray. High quality construction algorithms aim at reducing the SAM cost. The most widespread high quality greedy plane-sweep algorithm applies the surface area heuristic (SAH) which is a simplification of the SAM. Advances in research on quality metrics for BVHs have shown that greedy SAH-based plane-sweep builders often construct BVHs with superior traversal performance despite the fact that the resulting SAM costs are higher than those created by more sophisticated builders. Motivated by this observation we examine different construction algorithms that use the SAM cost of temporarily constructed SAH-built BVHs to guide the construction to higher quality BVHs. An extensive evaluation reveals that the resulting BVHs indeed achieve significantly higher trace performance for primary and secondary diffuse rays compared to BVHs constructed with standard plane-sweeping. Compared to the Spatial-BVH, a kd-tree/BVH hybrid, we still achieve an acceptable increase in performance. We show that the proposed algorithm has subquadratic computational complexity in the number of primitives, which renders it usable in practical applications. An alternative construction algorithm to the plane-sweep BVH builder is agglomerative clustering, which constructs BVHs in a bottom-up fashion. It clusters primitives with a SAM-inspired heuristic and gives mixed quality BVHs compared to standard plane-sweeping construction. While related work only focused on the construction speed of this algorithm we examine clustering heuristics, which aim at higher hierarchy quality. We propose a fully SAM-based clustering heuristic which on average produces better performing BVHs compared to original agglomerative clustering. The definitions of SAM and SAH are based on assumptions on the distribution of ray origins and directions to define a conditional geometric probability for intersecting nodes in kd-trees and BVHs. We analyze the probability function definition and show that the assumptions allow for an alternative probability definition. Unlike the conventional probability, our definition accounts for directional variation in the likelihood of intersecting objects from different directions. While the new probability does not result in improved practical tracing performance, we are able to provide an interesting insight on the conventional probability. We show that the conventional probability function is directly linked to our examined probability function and can be interpreted as covertly accounting for directional variation. The path tracing light transport algorithm can require tracing of billions of rays. Thus, it can pay off to construct high quality acceleration structures to reduce the ray tracing cost of each ray. At the same time, the arising number of trace operations offers a tremendous amount of data parallelism. With CPUs moving towards many-core architectures and GPUs becoming more general purpose architectures, path tracing can now be well parallelized on commodity hardware. While parallelization is trivial in theory, properties of real hardware make efficient parallelization difficult, especially when tracing so called incoherent rays. These rays cause execution flow divergence, which reduces efficiency of SIMD-based parallelism and memory read efficiency due to incoherent memory access. We investigate how different BVH and node memory layouts as well as storing the BVH in different memory areas impacts the ray tracing performance of a GPU path tracer. We also optimize the BVH layout using information gathered in a pre-processing pass by applying a number of different BVH reordering techniques. This results in increased ray tracing performance. Our final contribution is in the field of fast high quality BVH and kd-tree construction. Increased quality usually comes at the cost of higher construction time. To reduce construction time several algorithms have been proposed to construct acceleration structures in parallel on GPUs. These are able to perform full rebuilds in realtime for moderate scene sizes if all data completely fits into GPU memory. The sheer amount of data arising from geometric detail used in production rendering makes construction on GPUs, however, infeasible due to GPU memory limitations. Existing out-of-core GPU approaches perform hybrid bottom-up top-down construction which suffers from reduced acceleration structure quality in the critical upper levels of the tree. We present an out-of-core multi-GPU approach for full top-down SAH-based BVH and kd-tree construction, which is designed to work on larger scenes than conventional approaches and yields high quality trees. The algorithm is evaluated for scenes consisting of up to 1 billion triangles and performance scales with an increasing number of GPUs

    An evaluation of the GAMA/StarPU frameworks for heterogeneous platforms : the progressive photon mapping algorithm

    Get PDF
    Dissertação de mestrado em Engenharia InformáticaRecent evolution of high performance computing moved towards heterogeneous platforms: multiple devices with different architectures, characteristics and programming models, share application workloads. To aid the programmer to efficiently explore these heterogeneous platforms several frameworks have been under development. These dynamically manage the available computing resources through workload scheduling and data distribution, dealing with the inherent difficulties of different programming models and memory accesses. Among other frameworks, these include GAMA and StarPU. The GAMA framework aims to unify the multiple execution and memory models of each different device in a computer system, into a single, hardware agnostic model. It was designed to efficiently manage resources with both regular and irregular applications, and currently only supports conventional CPU devices and CUDA-enabled accelerators. StarPU has similar goals and features with a wider user based community, but it lacks a single programming model. The main goal of this dissertation was an in-depth evaluation of a heterogeneous framework using a complex application as a case study. GAMA provided the starting vehicle for training, while StarPU was the selected framework for a thorough evaluation. The progressive photon mapping irregular algorithm was the selected case study. The evaluation goal was to assert the StarPU effectiveness with a robust irregular application, and make a high-level comparison with the still under development GAMA, to provide some guidelines for GAMA improvement. Results show that two main factors contribute to the performance of applications written with StarPU: the consideration of data transfers in the performance model, and chosen scheduler. The study also allowed some caveats to be found within the StarPU API. Although this have no effect on performance, they present a challenge for new coming developers. Both these analysis resulted in a better understanding of the framework, and a comparative analysis with GAMA could be made, pointing out the aspects where GAMA could be further improved upon.A recente evolução da computação de alto desempenho é em direção ao uso de plataformas heterogéneas: múltiplos dispositivos com diferentes arquiteturas, características e modelos de programação, partilhando a carga computacional das aplicações. De modo a ajudar o programador a explorar eficientemente estas plataformas, várias frameworks têm sido desenvolvidas. Estas frameworks gerem os recursos computacionais disponíveis, tratando das dificuldades inerentes dos diferentes modelos de programação e acessos à memória. Entre outras frameworks, estas incluem o GAMA e o StarPU. O GAMA tem o objetivo de unificar os múltiplos modelos de execução e memória de cada dispositivo diferente num sistema computacional, transformando-os num único modelo, independente do hardware utilizado. A framework foi desenhada de forma a gerir eficientemente os recursos, tanto para aplicações regulares como irregulares, e atualmente suporta apenas CPUs convencionais e aceleradores CUDA. O StarPU tem objetivos e funcionalidades idênticos, e também uma comunidade mais alargada, mas não possui um modelo de programação único O objetivo principal desta dissertação foi uma avaliação profunda de uma framework heterogénea, usando uma aplicação complexa como caso de estudo. O GAMA serviu como ponto de partida para treino e ambientação, enquanto que o StarPU foi a framework selecionada para uma avaliação mais profunda. O algoritmo irregular de progressive photon mapping foi o caso de estudo escolhido. O objetivo da avaliação foi determinar a eficácia do StarPU com uma aplicação robusta, e fazer uma análise de alto nível com o GAMA, que ainda está em desenvolvimento, para forma a providenciar algumas sugestões para o seu melhoramento. Os resultados mostram que são dois os principais factores que contribuem para a performance de aplicação escritas com auxílio do StarPU: a avaliação dos tempos de transferência de dados no modelo de performance, e a escolha do escalonador. O estudo permitiu também avaliar algumas lacunas na API do StarPU. Embora estas não tenham efeitos visíveis na eficiencia da framework, eles tornam-se um desafio para recém-chegados ao StarPU. Ambas estas análisos resultaram numa melhor compreensão da framework, e numa análise comparativa com o GAMA, onde são apontados os possíveis aspectos que o este tem a melhorar.Fundação para a Ciência e a Tecnologia (FCT) - Program UT Austin | Portuga

    Generalized database index structures on massively parallel processor architectures

    Get PDF
    Height-balanced search trees are ubiquitous in database management systems as well as in other applications that require efficient access methods in order to identify entries in large data volumes. They can be configured with various strategies for structuring the search space for a given data set and for pruning it when different kinds of search queries are answered. In order to facilitate the development of application-specific tree variants, index frameworks, such as GiST, exist that provide a reusable library of commonly shared tree management functionality. By specializing internal data organization strategies, the framework can be customized to create an index that is efficient for an application's data access characteristics. Because the majority of the framework's code can be reused development and testing efforts are significantly lower, compared to an implementation from scratch. However, none of the existing frameworks supports the execution of index operations on massively parallel processor architectures, such as GPUs. Enabling the use of such processors for generalized index frameworks is the goal of this thesis. By compiling state-of-the-art techniques from a wide range of CPU- and GPU-optimized indexes, a GiST extension is developed that abstracts the physical execution aspect of generic, tree-based search queries. Tree traversals are broken-down into vectorized processing primitives that can be scheduled to one of the available (co-)processors for execution. Further, a CPU-based implementation is provided as well as a new GPU-based algorithm that, unlike prior art in this area, does not require that the index is fully stored inside a GPU's main memory buffer. The applicability of the extended framework is assessed for image rendering engines and, based on microbenchmarks, the parallelized algorithm performance is compared for different CPU and GPU generations. It will be shown that cases exist, where the GPU clearly outperforms the CPU and vice versa. In order to leverage the strengths of each processor type, an adaptive scheduler is presented that can be calibrated to schedule index operations to the best-fitting device in a hybrid system. With the help of a tree traversal simulation different scheduling strategies are evaluated and it will be shown that the adaptive scheduler can be used to make near-optimal decisions.Suchbäume sind allgegenwärtig in Datenbanksystemen und anderen Anwendungen, die eine effiziente Möglichkeit benötigen um in großen Datensätzen nach Einträgen zu suchen, die bestimmte Suchkriterien erfüllen. Sie können mit verschiedenen Strategien konfiguriert werden um den Suchraum zu strukturieren und die für ein Suchergebnis irrelevante Bereiche von der Bearbeitung auszuschließen. Die Entwicklung von anwendungsspezifischen Indexen wird durch Frameworks wie GiST unterstützt. Jedoch unterstützt keines der heute bereits existierenden Frameworks die Verwendung von hochgradig parallelen Prozessorarchitekturen wie GPUs. Solche Prozessoren für generische Index Frameworks nutzbar zu machen, ist Ziel dieser Arbeit. Dazu werden Techniken aus verschiedensten CPU- und GPU-optimierten Indexen analysiert und für die Entwicklung einer GiST-Erweiterung verwendet, welche die für eine Suche in Suchbäumen nötigen Berechnungen abstrahiert. Traversierungsoperationen werden dabei auf vektorisierte Primitive abgebildet, die auf parallelen Prozessoren implementiert werden können. Die Verwendung dieser Erweiterung wird beispielhaft an einem CPU Algorithmus demonstriert. Weiterhin wird ein neuer GPU-basierter Algorithmus vorgestellt, der im Vergleich zu bisherigen Verfahren, ein dynamisches Nachladen der Index Daten in den Hauptspeicher der GPU unterstützt. Die Praktikabilität des erweiterten Frameworks wird am Beispiel von Anwendungen aus der Computergrafik untersucht und die Performanz der verwendeten Algorithmen mit Hilfe eines Benchmarks auf verschiedenen CPU- und GPU-Modellen analysiert. Dabei wird gezeigt, unter welchen Bedingungen die parallele GPU-basierte Ausführung schneller ist als die CPU-basierte Variante - und umgekehrt. Um die Stärken beider Prozessortypen in einem hybriden System ausnutzen zu können, wird ein Scheduler entwickelt, der nach einer Kalibrierungsphase für eine gegebene Operation den geeignetsten Prozessor wählen kann. Mit Hilfe eines Simulators für Baumtraversierungen werden verschiedenste Scheduling Strategien verglichen. Dabei wird gezeigt, dass die Entscheidungen des Schedulers kaum vom Optimum abweichen und, abhängig von der simulierten Last, die erzielbaren Durchsätze für die parallele Ausführung mehrerer Suchoperationen durch hybrides Scheduling um eine Größenordnung und mehr erhöht werden können
    corecore