1,176 research outputs found

    HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork

    Full text link
    Neural Radiance Fields (NeRF) have become an increasingly popular representation to capture high-quality appearance and shape of scenes and objects. However, learning generalizable NeRF priors over categories of scenes or objects has been challenging due to the high dimensionality of network weight space. To address the limitations of existing work on generalization, multi-view consistency and to improve quality, we propose HyP-NeRF, a latent conditioning method for learning generalizable category-level NeRF priors using hypernetworks. Rather than using hypernetworks to estimate only the weights of a NeRF, we estimate both the weights and the multi-resolution hash encodings resulting in significant quality gains. To improve quality even further, we incorporate a denoise and finetune strategy that denoises images rendered from NeRFs estimated by the hypernetwork and finetunes it while retaining multiview consistency. These improvements enable us to use HyP-NeRF as a generalizable prior for multiple downstream tasks including NeRF reconstruction from single-view or cluttered scenes and text-to-NeRF. We provide qualitative comparisons and evaluate HyP-NeRF on three tasks: generalization, compression, and retrieval, demonstrating our state-of-the-art results

    Fast Non-Rigid Radiance Fields from Monocularized Data

    Get PDF
    3D reconstruction and novel view synthesis of dynamic scenes from collectionsof single views recently gained increased attention. Existing work showsimpressive results for synthetic setups and forward-facing real-world data, butis severely limited in the training speed and angular range for generatingnovel views. This paper addresses these limitations and proposes a new methodfor full 360{\deg} novel view synthesis of non-rigidly deforming scenes. At thecore of our method are: 1) An efficient deformation module that decouples theprocessing of spatial and temporal information for acceleration at training andinference time; and 2) A static module representing the canonical scene as afast hash-encoded neural radiance field. We evaluate the proposed approach onthe established synthetic D-NeRF benchmark, that enables efficientreconstruction from a single monocular view per time-frame randomly sampledfrom a full hemisphere. We refer to this form of inputs as monocularized data.To prove its practicality for real-world scenarios, we recorded twelvechallenging sequences with human actors by sampling single frames from asynchronized multi-view rig. In both cases, our method is trained significantlyfaster than previous methods (minutes instead of days) while achieving highervisual accuracy for generated novel views. Our source code and data isavailable at our project pagehttps://graphics.tu-bs.de/publications/kappel2022fast.<br

    NeRSemble: Multi-view Radiance Field Reconstruction of Human Heads

    Full text link
    We focus on reconstructing high-fidelity radiance fields of human heads, capturing their animations over time, and synthesizing re-renderings from novel viewpoints at arbitrary time steps. To this end, we propose a new multi-view capture setup composed of 16 calibrated machine vision cameras that record time-synchronized images at 7.1 MP resolution and 73 frames per second. With our setup, we collect a new dataset of over 4700 high-resolution, high-framerate sequences of more than 220 human heads, from which we introduce a new human head reconstruction benchmark. The recorded sequences cover a wide range of facial dynamics, including head motions, natural expressions, emotions, and spoken language. In order to reconstruct high-fidelity human heads, we propose Dynamic Neural Radiance Fields using Hash Ensembles (NeRSemble). We represent scene dynamics by combining a deformation field and an ensemble of 3D multi-resolution hash encodings. The deformation field allows for precise modeling of simple scene movements, while the ensemble of hash encodings helps to represent complex dynamics. As a result, we obtain radiance field representations of human heads that capture motion over time and facilitate re-rendering of arbitrary novel viewpoints. In a series of experiments, we explore the design choices of our method and demonstrate that our approach outperforms state-of-the-art dynamic radiance field approaches by a significant margin.Comment: Siggraph 2023, Project Page: https://tobias-kirschstein.github.io/nersemble/ , Video: https://youtu.be/a-OAWqBzld

    TeCH: Text-guided Reconstruction of Lifelike Clothed Humans

    Full text link
    Despite recent research advancements in reconstructing clothed humans from a single image, accurately restoring the "unseen regions" with high-level details remains an unsolved challenge that lacks attention. Existing methods often generate overly smooth back-side surfaces with a blurry texture. But how to effectively capture all visual attributes of an individual from a single image, which are sufficient to reconstruct unseen areas (e.g., the back view)? Motivated by the power of foundation models, TeCH reconstructs the 3D human by leveraging 1) descriptive text prompts (e.g., garments, colors, hairstyles) which are automatically generated via a garment parsing model and Visual Question Answering (VQA), 2) a personalized fine-tuned Text-to-Image diffusion model (T2I) which learns the "indescribable" appearance. To represent high-resolution 3D clothed humans at an affordable cost, we propose a hybrid 3D representation based on DMTet, which consists of an explicit body shape grid and an implicit distance field. Guided by the descriptive prompts + personalized T2I diffusion model, the geometry and texture of the 3D humans are optimized through multi-view Score Distillation Sampling (SDS) and reconstruction losses based on the original observation. TeCH produces high-fidelity 3D clothed humans with consistent & delicate texture, and detailed full-body geometry. Quantitative and qualitative experiments demonstrate that TeCH outperforms the state-of-the-art methods in terms of reconstruction accuracy and rendering quality. The code will be publicly available for research purposes at https://huangyangyi.github.io/TeCHComment: Project: https://huangyangyi.github.io/TeCH, Code: https://github.com/huangyangyi/TeC

    Fast Non-Rigid Radiance Fields from Monocularized Data

    Full text link
    The reconstruction and novel view synthesis of dynamic scenes recently gained increased attention. As reconstruction from large-scale multi-view data involves immense memory and computational requirements, recent benchmark datasets provide collections of single monocular views per timestamp sampled from multiple (virtual) cameras. We refer to this form of inputs as "monocularized" data. Existing work shows impressive results for synthetic setups and forward-facing real-world data, but is often limited in the training speed and angular range for generating novel views. This paper addresses these limitations and proposes a new method for full 360{\deg} inward-facing novel view synthesis of non-rigidly deforming scenes. At the core of our method are: 1) An efficient deformation module that decouples the processing of spatial and temporal information for accelerated training and inference; and 2) A static module representing the canonical scene as a fast hash-encoded neural radiance field. In addition to existing synthetic monocularized data, we systematically analyze the performance on real-world inward-facing scenes using a newly recorded challenging dataset sampled from a synchronized large-scale multi-view rig. In both cases, our method is significantly faster than previous methods, converging in less than 7 minutes and achieving real-time framerates at 1K resolution, while obtaining a higher visual accuracy for generated novel views. Our source code and data is available at our project page https://graphics.tu-bs.de/publications/kappel2022fast.Comment: 18 pages, 14 figures; project page: https://graphics.tu-bs.de/publications/kappel2022fas

    Playing for Data: Ground Truth from Computer Games

    Full text link
    Recent progress in computer vision has been driven by high-capacity models trained on large datasets. Unfortunately, creating large datasets with pixel-level labels has been extremely costly due to the amount of human effort required. In this paper, we present an approach to rapidly creating pixel-accurate semantic label maps for images extracted from modern computer games. Although the source code and the internal operation of commercial games are inaccessible, we show that associations between image patches can be reconstructed from the communication between the game and the graphics hardware. This enables rapid propagation of semantic labels within and across images synthesized by the game, with no access to the source code or the content. We validate the presented approach by producing dense pixel-level semantic annotations for 25 thousand images synthesized by a photorealistic open-world computer game. Experiments on semantic segmentation datasets show that using the acquired data to supplement real-world images significantly increases accuracy and that the acquired data enables reducing the amount of hand-labeled real-world data: models trained with game data and just 1/3 of the CamVid training set outperform models trained on the complete CamVid training set.Comment: Accepted to the 14th European Conference on Computer Vision (ECCV 2016
    • …
    corecore