30 research outputs found
DiffuStereo: High Quality Human Reconstruction via Diffusion-based Stereo Using Sparse Cameras
We propose DiffuStereo, a novel system using only sparse cameras (8 in this
work) for high-quality 3D human reconstruction. At its core is a novel
diffusion-based stereo module, which introduces diffusion models, a type of
powerful generative models, into the iterative stereo matching network. To this
end, we design a new diffusion kernel and additional stereo constraints to
facilitate stereo matching and depth estimation in the network. We further
present a multi-level stereo network architecture to handle high-resolution (up
to 4k) inputs without requiring unaffordable memory footprint. Given a set of
sparse-view color images of a human, the proposed multi-level diffusion-based
stereo network can produce highly accurate depth maps, which are then converted
into a high-quality 3D human model through an efficient multi-view fusion
strategy. Overall, our method enables automatic reconstruction of human models
with quality on par to high-end dense-view camera rigs, and this is achieved
using a much more light-weight hardware setup. Experiments show that our method
outperforms state-of-the-art methods by a large margin both qualitatively and
quantitatively.Comment: Accepted by ECCV202
Learning Implicit Templates for Point-Based Clothed Human Modeling
We present FITE, a First-Implicit-Then-Explicit framework for modeling human
avatars in clothing. Our framework first learns implicit surface templates
representing the coarse clothing topology, and then employs the templates to
guide the generation of point sets which further capture pose-dependent
clothing deformations such as wrinkles. Our pipeline incorporates the merits of
both implicit and explicit representations, namely, the ability to handle
varying topology and the ability to efficiently capture fine details. We also
propose diffused skinning to facilitate template training especially for loose
clothing, and projection-based pose-encoding to extract pose information from
mesh templates without predefined UV map or connectivity. Our code is publicly
available at https://github.com/jsnln/fite.Comment: Accepted to ECCV 202
Tensor4D : Efficient Neural 4D Decomposition for High-fidelity Dynamic Reconstruction and Rendering
We present Tensor4D, an efficient yet effective approach to dynamic scene
modeling. The key of our solution is an efficient 4D tensor decomposition
method so that the dynamic scene can be directly represented as a 4D
spatio-temporal tensor. To tackle the accompanying memory issue, we decompose
the 4D tensor hierarchically by projecting it first into three time-aware
volumes and then nine compact feature planes. In this way, spatial information
over time can be simultaneously captured in a compact and memory-efficient
manner. When applying Tensor4D for dynamic scene reconstruction and rendering,
we further factorize the 4D fields to different scales in the sense that
structural motions and dynamic detailed changes can be learned from coarse to
fine. The effectiveness of our method is validated on both synthetic and
real-world scenes. Extensive experiments show that our method is able to
achieve high-quality dynamic reconstruction and rendering from sparse-view
camera rigs or even a monocular camera. The code and dataset will be released
at https://liuyebin.com/tensor4d/tensor4d.html
Control4D: Dynamic Portrait Editing by Learning 4D GAN from 2D Diffusion-based Editor
Recent years have witnessed considerable achievements in editing images with
text instructions. When applying these editors to dynamic scene editing, the
new-style scene tends to be temporally inconsistent due to the frame-by-frame
nature of these 2D editors. To tackle this issue, we propose Control4D, a novel
approach for high-fidelity and temporally consistent 4D portrait editing.
Control4D is built upon an efficient 4D representation with a 2D
diffusion-based editor. Instead of using direct supervisions from the editor,
our method learns a 4D GAN from it and avoids the inconsistent supervision
signals. Specifically, we employ a discriminator to learn the generation
distribution based on the edited images and then update the generator with the
discrimination signals. For more stable training, multi-level information is
extracted from the edited images and used to facilitate the learning of the
generator. Experimental results show that Control4D surpasses previous
approaches and achieves more photo-realistic and consistent 4D editing
performances. The link to our project website is
https://control4darxiv.github.io.Comment: The link to our project website is https://control4darxiv.github.i
High-fidelity human avatars from a single RGB camera
In this paper, we propose a coarse-to-fine framework to reconstruct a personalized high-fidelity human avatar from a monocular video. To deal with the misalignment problem caused by the changed poses and shapes in different frames, we design a dynamic surface network to recover pose-dependent surface deformations, which help to decouple the shape and texture of the person. To cope with the complexity of textures and generate photo-realistic results, we propose a reference-based neural rendering network and exploit a bottom-up sharpening-guided fine-tuning strategy to obtain detailed textures. Our frame-work also enables photo-realistic novel view/pose syn-thesis and shape editing applications. Experimental re-sults on both the public dataset and our collected dataset demonstrate that our method outperforms the state-of-the-art methods. The code and dataset will be available at http://cic.tju.edu.cn/faculty/likun/projects/HF-Avatar