1,176 research outputs found
HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork
Neural Radiance Fields (NeRF) have become an increasingly popular
representation to capture high-quality appearance and shape of scenes and
objects. However, learning generalizable NeRF priors over categories of scenes
or objects has been challenging due to the high dimensionality of network
weight space. To address the limitations of existing work on generalization,
multi-view consistency and to improve quality, we propose HyP-NeRF, a latent
conditioning method for learning generalizable category-level NeRF priors using
hypernetworks. Rather than using hypernetworks to estimate only the weights of
a NeRF, we estimate both the weights and the multi-resolution hash encodings
resulting in significant quality gains. To improve quality even further, we
incorporate a denoise and finetune strategy that denoises images rendered from
NeRFs estimated by the hypernetwork and finetunes it while retaining multiview
consistency. These improvements enable us to use HyP-NeRF as a generalizable
prior for multiple downstream tasks including NeRF reconstruction from
single-view or cluttered scenes and text-to-NeRF. We provide qualitative
comparisons and evaluate HyP-NeRF on three tasks: generalization, compression,
and retrieval, demonstrating our state-of-the-art results
Fast Non-Rigid Radiance Fields from Monocularized Data
3D reconstruction and novel view synthesis of dynamic scenes from collectionsof single views recently gained increased attention. Existing work showsimpressive results for synthetic setups and forward-facing real-world data, butis severely limited in the training speed and angular range for generatingnovel views. This paper addresses these limitations and proposes a new methodfor full 360{\deg} novel view synthesis of non-rigidly deforming scenes. At thecore of our method are: 1) An efficient deformation module that decouples theprocessing of spatial and temporal information for acceleration at training andinference time; and 2) A static module representing the canonical scene as afast hash-encoded neural radiance field. We evaluate the proposed approach onthe established synthetic D-NeRF benchmark, that enables efficientreconstruction from a single monocular view per time-frame randomly sampledfrom a full hemisphere. We refer to this form of inputs as monocularized data.To prove its practicality for real-world scenarios, we recorded twelvechallenging sequences with human actors by sampling single frames from asynchronized multi-view rig. In both cases, our method is trained significantlyfaster than previous methods (minutes instead of days) while achieving highervisual accuracy for generated novel views. Our source code and data isavailable at our project pagehttps://graphics.tu-bs.de/publications/kappel2022fast.<br
NeRSemble: Multi-view Radiance Field Reconstruction of Human Heads
We focus on reconstructing high-fidelity radiance fields of human heads,
capturing their animations over time, and synthesizing re-renderings from novel
viewpoints at arbitrary time steps. To this end, we propose a new multi-view
capture setup composed of 16 calibrated machine vision cameras that record
time-synchronized images at 7.1 MP resolution and 73 frames per second. With
our setup, we collect a new dataset of over 4700 high-resolution,
high-framerate sequences of more than 220 human heads, from which we introduce
a new human head reconstruction benchmark. The recorded sequences cover a wide
range of facial dynamics, including head motions, natural expressions,
emotions, and spoken language. In order to reconstruct high-fidelity human
heads, we propose Dynamic Neural Radiance Fields using Hash Ensembles
(NeRSemble). We represent scene dynamics by combining a deformation field and
an ensemble of 3D multi-resolution hash encodings. The deformation field allows
for precise modeling of simple scene movements, while the ensemble of hash
encodings helps to represent complex dynamics. As a result, we obtain radiance
field representations of human heads that capture motion over time and
facilitate re-rendering of arbitrary novel viewpoints. In a series of
experiments, we explore the design choices of our method and demonstrate that
our approach outperforms state-of-the-art dynamic radiance field approaches by
a significant margin.Comment: Siggraph 2023, Project Page:
https://tobias-kirschstein.github.io/nersemble/ , Video:
https://youtu.be/a-OAWqBzld
TeCH: Text-guided Reconstruction of Lifelike Clothed Humans
Despite recent research advancements in reconstructing clothed humans from a
single image, accurately restoring the "unseen regions" with high-level details
remains an unsolved challenge that lacks attention. Existing methods often
generate overly smooth back-side surfaces with a blurry texture. But how to
effectively capture all visual attributes of an individual from a single image,
which are sufficient to reconstruct unseen areas (e.g., the back view)?
Motivated by the power of foundation models, TeCH reconstructs the 3D human by
leveraging 1) descriptive text prompts (e.g., garments, colors, hairstyles)
which are automatically generated via a garment parsing model and Visual
Question Answering (VQA), 2) a personalized fine-tuned Text-to-Image diffusion
model (T2I) which learns the "indescribable" appearance. To represent
high-resolution 3D clothed humans at an affordable cost, we propose a hybrid 3D
representation based on DMTet, which consists of an explicit body shape grid
and an implicit distance field. Guided by the descriptive prompts +
personalized T2I diffusion model, the geometry and texture of the 3D humans are
optimized through multi-view Score Distillation Sampling (SDS) and
reconstruction losses based on the original observation. TeCH produces
high-fidelity 3D clothed humans with consistent & delicate texture, and
detailed full-body geometry. Quantitative and qualitative experiments
demonstrate that TeCH outperforms the state-of-the-art methods in terms of
reconstruction accuracy and rendering quality. The code will be publicly
available for research purposes at https://huangyangyi.github.io/TeCHComment: Project: https://huangyangyi.github.io/TeCH, Code:
https://github.com/huangyangyi/TeC
Fast Non-Rigid Radiance Fields from Monocularized Data
The reconstruction and novel view synthesis of dynamic scenes recently gained
increased attention. As reconstruction from large-scale multi-view data
involves immense memory and computational requirements, recent benchmark
datasets provide collections of single monocular views per timestamp sampled
from multiple (virtual) cameras. We refer to this form of inputs as
"monocularized" data. Existing work shows impressive results for synthetic
setups and forward-facing real-world data, but is often limited in the training
speed and angular range for generating novel views. This paper addresses these
limitations and proposes a new method for full 360{\deg} inward-facing novel
view synthesis of non-rigidly deforming scenes. At the core of our method are:
1) An efficient deformation module that decouples the processing of spatial and
temporal information for accelerated training and inference; and 2) A static
module representing the canonical scene as a fast hash-encoded neural radiance
field. In addition to existing synthetic monocularized data, we systematically
analyze the performance on real-world inward-facing scenes using a newly
recorded challenging dataset sampled from a synchronized large-scale multi-view
rig. In both cases, our method is significantly faster than previous methods,
converging in less than 7 minutes and achieving real-time framerates at 1K
resolution, while obtaining a higher visual accuracy for generated novel views.
Our source code and data is available at our project page
https://graphics.tu-bs.de/publications/kappel2022fast.Comment: 18 pages, 14 figures; project page:
https://graphics.tu-bs.de/publications/kappel2022fas
Playing for Data: Ground Truth from Computer Games
Recent progress in computer vision has been driven by high-capacity models
trained on large datasets. Unfortunately, creating large datasets with
pixel-level labels has been extremely costly due to the amount of human effort
required. In this paper, we present an approach to rapidly creating
pixel-accurate semantic label maps for images extracted from modern computer
games. Although the source code and the internal operation of commercial games
are inaccessible, we show that associations between image patches can be
reconstructed from the communication between the game and the graphics
hardware. This enables rapid propagation of semantic labels within and across
images synthesized by the game, with no access to the source code or the
content. We validate the presented approach by producing dense pixel-level
semantic annotations for 25 thousand images synthesized by a photorealistic
open-world computer game. Experiments on semantic segmentation datasets show
that using the acquired data to supplement real-world images significantly
increases accuracy and that the acquired data enables reducing the amount of
hand-labeled real-world data: models trained with game data and just 1/3 of the
CamVid training set outperform models trained on the complete CamVid training
set.Comment: Accepted to the 14th European Conference on Computer Vision (ECCV
2016
- …