3,249 research outputs found
Hardware Acceleration of Neural Graphics
Rendering and inverse-rendering algorithms that drive conventional computer
graphics have recently been superseded by neural representations (NR). NRs have
recently been used to learn the geometric and the material properties of the
scenes and use the information to synthesize photorealistic imagery, thereby
promising a replacement for traditional rendering algorithms with scalable
quality and predictable performance. In this work we ask the question: Does
neural graphics (NG) need hardware support? We studied representative NG
applications showing that, if we want to render 4k res. at 60FPS there is a gap
of 1.5X-55X in the desired performance on current GPUs. For AR/VR applications,
there is an even larger gap of 2-4 OOM between the desired performance and the
required system power. We identify that the input encoding and the MLP kernels
are the performance bottlenecks, consuming 72%,60% and 59% of application time
for multi res. hashgrid, multi res. densegrid and low res. densegrid encodings,
respectively. We propose a NG processing cluster, a scalable and flexible
hardware architecture that directly accelerates the input encoding and MLP
kernels through dedicated engines and supports a wide range of NG applications.
We also accelerate the rest of the kernels by fusing them together in Vulkan,
which leads to 9.94X kernel-level performance improvement compared to un-fused
implementation of the pre-processing and the post-processing kernels. Our
results show that, NGPC gives up to 58X end-to-end application-level
performance improvement, for multi res. hashgrid encoding on average across the
four NG applications, the performance benefits are 12X,20X,33X and 39X for the
scaling factor of 8,16,32 and 64, respectively. Our results show that with
multi res. hashgrid encoding, NGPC enables the rendering of 4k res. at 30FPS
for NeRF and 8k res. at 120FPS for all our other NG applications
MetaFork: A Compilation Framework for Concurrency Models Targeting Hardware Accelerators
Parallel programming is gaining ground in various domains due to the tremendous computational power that it brings; however, it also requires a substantial code crafting effort to achieve performance improvement. Unfortunately, in most cases, performance tuning has to be accomplished manually by programmers. We argue that automated tuning is necessary due to the combination of the following factors. First, code optimization is machine-dependent. That is, optimization preferred on one machine may be not suitable for another machine. Second, as the possible optimization search space increases, manually finding an optimized configuration is hard. Therefore, developing new compiler techniques for optimizing applications is of considerable interest. This thesis aims at generating new techniques that will help programmers develop efficient algorithms and code targeting hardware acceleration technologies, in a more effective manner. Our work is organized around a compilation framework, called MetaFork, for concurrency platforms and its application to automatic parallelization. MetaFork is a high-level programming language extending C/C++, which combines several models of concurrency including fork-join, SIMD and pipelining parallelism. MetaFork is also a compilation framework which aims at facilitating the design and implementation of concurrent programs through four key features which make MetaFork unique and novel: (1) Perform automatic code translation between concurrency platforms targeting multi-core architectures. (2) Provide a high-level language for expressing concurrency as in the fork-join model, the SIMD paradigm and the pipelining parallelism. (3) Generate parallel code from serial code with an emphasis on code depending on machine or program parameters (e.g. cache size, number of processors, number of threads per thread block). (4) Optimize code depending on parameters that are unknown at compile-time
Data-Driven Shape Analysis and Processing
Data-driven methods play an increasingly important role in discovering
geometric, structural, and semantic relationships between 3D shapes in
collections, and applying this analysis to support intelligent modeling,
editing, and visualization of geometric data. In contrast to traditional
approaches, a key feature of data-driven approaches is that they aggregate
information from a collection of shapes to improve the analysis and processing
of individual shapes. In addition, they are able to learn models that reason
about properties and relationships of shapes without relying on hard-coded
rules or explicitly programmed instructions. We provide an overview of the main
concepts and components of these techniques, and discuss their application to
shape classification, segmentation, matching, reconstruction, modeling and
exploration, as well as scene analysis and synthesis, through reviewing the
literature and relating the existing works with both qualitative and numerical
comparisons. We conclude our report with ideas that can inspire future research
in data-driven shape analysis and processing.Comment: 10 pages, 19 figure
Towards Comprehensive Parametric Code Generation Targeting Graphics Processing Units in Support of Scientific Computation
The most popular multithreaded languages based on the fork-join concurrency model (CIlkPlus, OpenMP) are currently being extended to support other forms of parallelism (vectorization, pipelining and single-instruction-multiple-data (SIMD)). In the SIMD case, the objective is to execute the corresponding code on a many-core device, like a GPGPU, for which the CUDA language is a natural choice. Since the programming concepts of CilkPlus and OpenMP are very different from those of CUDA, it is desirable to automatically generate optimized CUDA-like code from CilkPlus or OpenMP.
In this thesis, we propose an accelerator model for annotated C/C++ code together with an implementation that allows the automatic generation of CUDA code. One of the key features of this CUDA code generator is that it supports the generation of CUDA kernel code where program parameters (like number of threads per block) and machine parameters (like shared memory size) are treated as unknown symbols. Hence, these parameters need not to be known at code-generation-time: machine parameters and program parameters can be respectively determined when the generated code is installed on the target machine. In addition, we show how these parametric CUDA programs can be optimized at compile-time in the form of a case discussion, where cases depend on the values of machine parameters (e.g. hardware resource limits) and program parameters (e.g. dimension sizes of thread-blocks). This generation of parametric CUDA kernels requires to deal with non-linear polynomial expressions during the dependence analysis and tiling phase. To achieve these algebraic calculations, we take advantage of techniques from computer algebra, in particular in the RegularChains library of Maple. Various illustrative examples are provided together with performance evaluation
- …