302 research outputs found

    Gigavoxels: ray-guided streaming for efficient and detailed voxel rendering

    Get PDF
    Figure 1: Images show volume data that consist of billions of voxels rendered with our dynamic sparse octree approach. Our algorithm achieves real-time to interactive rates on volumes exceeding the GPU memory capacities by far, tanks to an efficient streaming based on a ray-casting solution. Basically, the volume is only used at the resolution that is needed to produce the final image. Besides the gain in memory and speed, our rendering is inherently anti-aliased. We propose a new approach to efficiently render large volumetric data sets. The system achieves interactive to real-time rendering performance for several billion voxels. Our solution is based on an adaptive data representation depending on the current view and occlusion information, coupled to an efficient ray-casting rendering algorithm. One key element of our method is to guide data production and streaming directly based on information extracted during rendering. Our data structure exploits the fact that in CG scenes, details are often concentrated on the interface between free space and clusters of density and shows that volumetric models might become a valuable alternative as a rendering primitive for real-time applications. In this spirit, we allow a quality/performance trade-off and exploit temporal coherence. We also introduce a mipmapping-like process that allows for an increased display rate and better quality through high quality filtering. To further enrich the data set, we create additional details through a variety of procedural methods. We demonstrate our approach in several scenarios, like the exploration of a 3D scan (8192 3 resolution), of hypertextured meshes (16384 3 virtual resolution), or of a fractal (theoretically infinite resolution). All examples are rendered on current generation hardware at 20-90 fps and respect the limited GPU memory budget. This is the author’s version of the paper. The ultimate version has been published in the I3D 2009 conference proceedings.

    Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs

    Get PDF
    © 2023 Copyright held by the owner/author(s). This document is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/ This document is the Accepted version of a Published Work that appeared in final form in 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT), Viena, Austria, October 2023. To access the final edited and published work see https://doi.org/10.1109/PACT58117.2023.00019Literature is plentiful in works exploiting cache locality for GPUs. A majority of them explore replacement or bypassing policies. In this paper, however, we surpass this exploration by fabricating a formal proof for a no-overhead quasi-optimal caching technique for caching textures in graphics workloads. Textures make up a significant part of main memory traffic in mobile GPUs, which contributes to the total GPU energy consumption. Since texture accesses use a shared L2 cache, improving the L2 texture caching efficiency would decrease main memory traffic, thus improving energy efficiency, which is crucial for mobile GPUs. Our proposal reaches quasi-optimality by exploiting the frame-to-frame reuse of textures in graphics. We do this by traversing frames in a boustrophedonic1 manner w.r.t. the frame-to-frame tile order. We first approximate the texture access trace to a circular trace and then forge a formal proof for our proposal being optimal for such traces. We also complement the proof with empirical data that demonstrates the quasi-optimality of our no-cost proposal

    Interactive Out-of-core Visualization of Very Large Landscapes on Commodity Graphics Platforms

    Get PDF
    We recently introduced an efficient technique for out-of-core rendering and management of large textured landscapes. The technique, called Batched Dynamic Adaptive Meshes (BDAM), is based on a paired tree structure: a tiled quadtree for texture data and a pair of bintrees of small triangular patches for the geometry. These small patches are TINs that are constructed and optimized off-line with high quality simplification and tristripping algorithms. Hierarchical view frustum culling and view-dependendent texture/geometry refinement is performed at each frame with a stateless traversal algorithm that renders a continuous adaptive terrain surface by assembling out of core data. Thanks to the batched CPU/GPU communication model, the proposed technique is not processor intensive and fully harnesses the power of current graphics hardware. This paper summarizes the method and discusses the results obtained in a virtual flythrough over a textured digital landscape derived from aerial imaging.21-2

    Real-time per-face texture mapping on the GPU

    Get PDF
    Treball realitzat en el marc d'un programa de mobilitat amb el Karlsruhe Institute of Technology (KIT - University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association)Se trata de la implementación de un método de texturación por caras en tiempo real que trabaja con "out-of-core textures". Texturas tan grandes que sobrepasan la capacidad de memoria de la GPU. En cuanto a esto se ha trabajado con varias estrategias de caché para ver cuál de ellas funcionaba mejor

    Holistic Performance Analysis and Optimization of Unified Virtual Memory

    Get PDF
    The programming difficulty of creating GPU-accelerated high performance computing (HPC) codes has been greatly reduced by the advent of Unified Memory technologies that abstract the management of physical memory away from the developer. However, these systems incur substantial overhead that paradoxically grows for codes where these technologies are most useful. While these technologies are increasingly adopted for use in modern HPC frameworks and applications, the performance cost reduces the efficiency of these systems and turns away some developers from adoption entirely. These systems are naturally difficult to optimize due to the large number of interconnected hardware and software components that must be untangled to perform thorough analysis. In this thesis, we take the first deep dive into a functional implementation of a Unified Memory system, NVIDIA UVM, to evaluate the performance and characteristics of these systems. We show specific hardware and software interactions that cause serialization between host and devices. We further provide a quantitative evaluation of fault handling for various applications under different scenarios, including prefetching and oversubscription. Through lower-level analysis, we find that the driver workload is dependent on the interactions among application access patterns, GPU hardware constraints, and Host OS components. These findings indicate that the cost of host OS components is significant and present across UM implementations. We also provide a proof-of-concept asynchronous approach to memory management in UVM that allows for reduced system overhead and improved application performance. This study provides constructive insight into future implementations and systems, such as Heterogeneous Memory Management

    High-performance and hardware-aware computing: proceedings of the second International Workshop on New Frontiers in High-performance and Hardware-aware Computing (HipHaC\u2711), San Antonio, Texas, USA, February 2011 ; (in conjunction with HPCA-17)

    Get PDF
    High-performance system architectures are increasingly exploiting heterogeneity. The HipHaC workshop aims at combining new aspects of parallel, heterogeneous, and reconfigurable microprocessor technologies with concepts of high-performance computing and, particularly, numerical solution methods. Compute- and memory-intensive applications can only benefit from the full hardware potential if all features on all levels are taken into account in a holistic approach

    Grand Pwning Unit:Accelerating Microarchitectural Attacks with the GPU

    Get PDF
    Dark silicon is pushing processor vendors to add more specialized units such as accelerators to commodity processor chips. Unfortunately this is done without enough care to security. In this paper we look at the security implications of integrated Graphical Processor Units (GPUs) found in almost all mobile processors. We demonstrate that GPUs, already widely employed to accelerate a variety of benign applications such as image rendering, can also be used to 'accelerate' microarchitectural attacks (i.e., making them more effective) on commodity platforms. In particular, we show that an attacker can build all the necessary primitives for performing effective GPU-based microarchitectural attacks and that these primitives are all exposed to the web through standardized browser extensions, allowing side-channel and Rowhammer attacks from JavaScript. These attacks bypass state-of-the-art mitigations and advance existing CPU-based attacks: we show the first end-to-end microarchitectural compromise of a browser running on a mobile phone in under two minutes by orchestrating our GPU primitives. While powerful, these GPU primitives are not easy to implement due to undocumented hardware features. We describe novel reverse engineering techniques for peeking into the previously unknown cache architecture and replacement policy of the Adreno 330, an integrated GPU found in many common mobile platforms. This information is necessary when building shader programs implementing our GPU primitives. We conclude by discussing mitigations against GPU-enabled attackers

    The impact of cache misses on the performance of matrix product algorithms on multicore platforms

    Get PDF
    The multicore revolution is underway, bringing new chips introducing more complex memory architectures. Classical algorithms must be revisited in order to take the hierarchical memory layout into account. In this paper, we aim at designing cache-aware algorithms that minimize the number of cache misses paid during the execution of the matrix product kernel on a multicore processor. We analytically show how to achieve the best possible tradeoff between shared and distributed caches. We implement and evaluate several algorithms on two multicore platforms, one equipped with one Xeon quadcore, and the second one enriched with a GPU. It turns out that the impact of cache misses is very different across both platforms, and we identify what are the main design parameters that lead to peak performance for each target hardware configuration.La révolution multi-coeur est en cours, qui voit l'arrivée de processeurs dotées d'une architecture mémoire complexe. Les algorithmes les plus classiques doivent être revisités pour prendre en compte la disposition hiérarchique de la mémoire. Dans ce rapport, nous étudions des algorithmes prenant en compte les caches de données qui minimisent le nombre de défauts de cache pendant l'exécution d'un produit de matrices sur un processeur multi-coeur. Nous montrons analytiquement comment obtenir le meilleur compromis entre les caches partagés et distribués. Nous proposons une implémentation pour évaluer ces algorithmes sur deux plates-formes multi-coeur, l'une équipé d'un processeur Xeon quadri-coeur, l'autre dotée d'un GPU. Il apparaît que l'impact des défauts de cache est très différent sur ces deux plates-formes, et nous identifions quels sont les principaux paramètres de conception qui conduisent aux performances maximales pour chacune de ces configurations matérielles
    • …
    corecore