100 research outputs found
FLARE: Fast Learning of Animatable and Relightable Mesh Avatars
Our goal is to efficiently learn personalized animatable 3D head avatars from
videos that are geometrically accurate, realistic, relightable, and compatible
with current rendering systems. While 3D meshes enable efficient processing and
are highly portable, they lack realism in terms of shape and appearance. Neural
representations, on the other hand, are realistic but lack compatibility and
are slow to train and render. Our key insight is that it is possible to
efficiently learn high-fidelity 3D mesh representations via differentiable
rendering by exploiting highly-optimized methods from traditional computer
graphics and approximating some of the components with neural networks. To that
end, we introduce FLARE, a technique that enables the creation of animatable
and relightable mesh avatars from a single monocular video. First, we learn a
canonical geometry using a mesh representation, enabling efficient
differentiable rasterization and straightforward animation via learned
blendshapes and linear blend skinning weights. Second, we follow
physically-based rendering and factor observed colors into intrinsic albedo,
roughness, and a neural representation of the illumination, allowing the
learned avatars to be relit in novel scenes. Since our input videos are
captured on a single device with a narrow field of view, modeling the
surrounding environment light is non-trivial. Based on the split-sum
approximation for modeling specular reflections, we address this by
approximating the pre-filtered environment map with a multi-layer perceptron
(MLP) modulated by the surface roughness, eliminating the need to explicitly
model the light. We demonstrate that our mesh-based avatar formulation,
combined with learned deformation, material, and lighting MLPs, produces
avatars with high-quality geometry and appearance, while also being efficient
to train and render compared to existing approaches.Comment: 15 pages, Accepted: ACM Transactions on Graphics (Proceedings of
SIGGRAPH Asia), 202
SplatArmor: Articulated Gaussian splatting for animatable humans from monocular RGB videos
We propose SplatArmor, a novel approach for recovering detailed and
animatable human models by `armoring' a parameterized body model with 3D
Gaussians. Our approach represents the human as a set of 3D Gaussians within a
canonical space, whose articulation is defined by extending the skinning of the
underlying SMPL geometry to arbitrary locations in the canonical space. To
account for pose-dependent effects, we introduce a SE(3) field, which allows us
to capture both the location and anisotropy of the Gaussians. Furthermore, we
propose the use of a neural color field to provide color regularization and 3D
supervision for the precise positioning of these Gaussians. We show that
Gaussian splatting provides an interesting alternative to neural rendering
based methods by leverging a rasterization primitive without facing any of the
non-differentiability and optimization challenges typically faced in such
approaches. The rasterization paradigms allows us to leverage forward skinning,
and does not suffer from the ambiguities associated with inverse skinning and
warping. We show compelling results on the ZJU MoCap and People Snapshot
datasets, which underscore the effectiveness of our method for controllable
human synthesis
NSF: Neural Surface Fields for Human Modeling from Monocular Depth
Obtaining personalized 3D animatable avatars from a monocular camera has
several real world applications in gaming, virtual try-on, animation, and
VR/XR, etc. However, it is very challenging to model dynamic and fine-grained
clothing deformations from such sparse data. Existing methods for modeling 3D
humans from depth data have limitations in terms of computational efficiency,
mesh coherency, and flexibility in resolution and topology. For instance,
reconstructing shapes using implicit functions and extracting explicit meshes
per frame is computationally expensive and cannot ensure coherent meshes across
frames. Moreover, predicting per-vertex deformations on a pre-designed human
template with a discrete surface lacks flexibility in resolution and topology.
To overcome these limitations, we propose a novel method `\keyfeature: Neural
Surface Fields' for modeling 3D clothed humans from monocular depth. NSF
defines a neural field solely on the base surface which models a continuous and
flexible displacement field. NSF can be adapted to the base surface with
different resolution and topology without retraining at inference time.
Compared to existing approaches, our method eliminates the expensive per-frame
surface extraction while maintaining mesh coherency, and is capable of
reconstructing meshes with arbitrary resolution without retraining. To foster
research in this direction, we release our code in project page at:
https://yuxuan-xue.com/nsf.Comment: Accpted to ICCV 2023; Homepage at: https://yuxuan-xue.com/ns
Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars
Neural radiance fields are capable of reconstructing high-quality drivable
human avatars but are expensive to train and render. To reduce consumption, we
propose Animatable 3D Gaussian, which learns human avatars from input images
and poses. We extend 3D Gaussians to dynamic human scenes by modeling a set of
skinned 3D Gaussians and a corresponding skeleton in canonical space and
deforming 3D Gaussians to posed space according to the input poses. We
introduce hash-encoded shape and appearance to speed up training and propose
time-dependent ambient occlusion to achieve high-quality reconstructions in
scenes containing complex motions and dynamic shadows. On both novel view
synthesis and novel pose synthesis tasks, our method outperforms existing
methods in terms of training time, rendering speed, and reconstruction quality.
Our method can be easily extended to multi-human scenes and achieve comparable
novel view synthesis results on a scene with ten people in only 25 seconds of
training
AI-generated Content for Various Data Modalities: A Survey
AI-generated content (AIGC) methods aim to produce text, images, videos, 3D
assets, and other media using AI algorithms. Due to its wide range of
applications and the demonstrated potential of recent works, AIGC developments
have been attracting lots of attention recently, and AIGC methods have been
developed for various data modalities, such as image, video, text, 3D shape (as
voxels, point clouds, meshes, and neural implicit fields), 3D scene, 3D human
avatar (body and head), 3D motion, and audio -- each presenting different
characteristics and challenges. Furthermore, there have also been many
significant developments in cross-modality AIGC methods, where generative
methods can receive conditioning input in one modality and produce outputs in
another. Examples include going from various modalities to image, video, 3D
shape, 3D scene, 3D avatar (body and head), 3D motion (skeleton and avatar),
and audio modalities. In this paper, we provide a comprehensive review of AIGC
methods across different data modalities, including both single-modality and
cross-modality methods, highlighting the various challenges, representative
works, and recent technical directions in each setting. We also survey the
representative datasets throughout the modalities, and present comparative
results for various modalities. Moreover, we also discuss the challenges and
potential future research directions
PERGAMO: Personalized 3D Garments from Monocular Video
Clothing plays a fundamental role in digital humans. Current approaches to
animate 3D garments are mostly based on realistic physics simulation, however,
they typically suffer from two main issues: high computational run-time cost,
which hinders their development; and simulation-to-real gap, which impedes the
synthesis of specific real-world cloth samples. To circumvent both issues we
propose PERGAMO, a data-driven approach to learn a deformable model for 3D
garments from monocular images. To this end, we first introduce a novel method
to reconstruct the 3D geometry of garments from a single image, and use it to
build a dataset of clothing from monocular videos. We use these 3D
reconstructions to train a regression model that accurately predicts how the
garment deforms as a function of the underlying body pose. We show that our
method is capable of producing garment animations that match the real-world
behaviour, and generalizes to unseen body motions extracted from motion capture
dataset.Comment: Published at Computer Graphics Forum (Proc. of ACM/SIGGRAPH SCA),
2022. Project website http://mslab.es/projects/PERGAMO
Instant Volumetric Head Avatars
We present Instant Volumetric Head Avatars (INSTA), a novel approach for
reconstructing photo-realistic digital avatars instantaneously. INSTA models a
dynamic neural radiance field based on neural graphics primitives embedded
around a parametric face model. Our pipeline is trained on a single monocular
RGB portrait video that observes the subject under different expressions and
views. While state-of-the-art methods take up to several days to train an
avatar, our method can reconstruct a digital avatar in less than 10 minutes on
modern GPU hardware, which is orders of magnitude faster than previous
solutions. In addition, it allows for the interactive rendering of novel poses
and expressions. By leveraging the geometry prior of the underlying parametric
face model, we demonstrate that INSTA extrapolates to unseen poses. In
quantitative and qualitative studies on various subjects, INSTA outperforms
state-of-the-art methods regarding rendering quality and training time.Comment: Website: https://zielon.github.io/insta/ Video:
https://youtu.be/HOgaeWTih7
Relightable and Animatable Neural Avatar from Sparse-View Video
This paper tackles the challenge of creating relightable and animatable
neural avatars from sparse-view (or even monocular) videos of dynamic humans
under unknown illumination. Compared to studio environments, this setting is
more practical and accessible but poses an extremely challenging ill-posed
problem. Previous neural human reconstruction methods are able to reconstruct
animatable avatars from sparse views using deformed Signed Distance Fields
(SDF) but cannot recover material parameters for relighting. While
differentiable inverse rendering-based methods have succeeded in material
recovery of static objects, it is not straightforward to extend them to dynamic
humans as it is computationally intensive to compute pixel-surface intersection
and light visibility on deformed SDFs for inverse rendering. To solve this
challenge, we propose a Hierarchical Distance Query (HDQ) algorithm to
approximate the world space distances under arbitrary human poses.
Specifically, we estimate coarse distances based on a parametric human model
and compute fine distances by exploiting the local deformation invariance of
SDF. Based on the HDQ algorithm, we leverage sphere tracing to efficiently
estimate the surface intersection and light visibility. This allows us to
develop the first system to recover animatable and relightable neural avatars
from sparse view (or monocular) inputs. Experiments demonstrate that our
approach is able to produce superior results compared to state-of-the-art
methods. Our code will be released for reproducibility.Comment: Project page: https://zju3dv.github.io/relightable_avata
TeCH: Text-guided Reconstruction of Lifelike Clothed Humans
Despite recent research advancements in reconstructing clothed humans from a
single image, accurately restoring the "unseen regions" with high-level details
remains an unsolved challenge that lacks attention. Existing methods often
generate overly smooth back-side surfaces with a blurry texture. But how to
effectively capture all visual attributes of an individual from a single image,
which are sufficient to reconstruct unseen areas (e.g., the back view)?
Motivated by the power of foundation models, TeCH reconstructs the 3D human by
leveraging 1) descriptive text prompts (e.g., garments, colors, hairstyles)
which are automatically generated via a garment parsing model and Visual
Question Answering (VQA), 2) a personalized fine-tuned Text-to-Image diffusion
model (T2I) which learns the "indescribable" appearance. To represent
high-resolution 3D clothed humans at an affordable cost, we propose a hybrid 3D
representation based on DMTet, which consists of an explicit body shape grid
and an implicit distance field. Guided by the descriptive prompts +
personalized T2I diffusion model, the geometry and texture of the 3D humans are
optimized through multi-view Score Distillation Sampling (SDS) and
reconstruction losses based on the original observation. TeCH produces
high-fidelity 3D clothed humans with consistent & delicate texture, and
detailed full-body geometry. Quantitative and qualitative experiments
demonstrate that TeCH outperforms the state-of-the-art methods in terms of
reconstruction accuracy and rendering quality. The code will be publicly
available for research purposes at https://huangyangyi.github.io/TeCHComment: Project: https://huangyangyi.github.io/TeCH, Code:
https://github.com/huangyangyi/TeC
- …