37 research outputs found
Learning to Reconstruct People in Clothing from a Single RGB Camera
We present a learning-based model to infer the personalized 3D shape of people from a few frames (1-8) of a monocular video in which the person is moving, in less than 10 seconds with a reconstruction accuracy of 5mm. Our model learns to predict the parameters of a statistical body model and instance displacements that add clothing and hair to the shape. The model achieves fast and accurate predictions based on two key design choices. First, by predicting shape in a canonical T-pose space, the network learns to encode the images of the person into pose-invariant latent codes, where the information is fused. Second, based on the observation that feed-forward predictions are fast but do not always align with the input images, we predict using both, bottom-up and top-down streams (one per view) allowing information to flow in both directions. Learning relies only on synthetic 3D data. Once learned, the model can take a variable number of frames as input, and is able to reconstruct shapes even from a single image with an accuracy of 6mm. Results on 3 different datasets demonstrate the efficacy and accuracy of our approach
Learning to Reconstruct People in Clothing from a Single RGB Camera
We present a learning-based model to infer the personalized 3D shape of
people from a few frames (1-8) of a monocular video in which the person is
moving, in less than 10 seconds with a reconstruction accuracy of 5mm. Our
model learns to predict the parameters of a statistical body model and instance
displacements that add clothing and hair to the shape. The model achieves fast
and accurate predictions based on two key design choices. First, by predicting
shape in a canonical T-pose space, the network learns to encode the images of
the person into pose-invariant latent codes, where the information is fused.
Second, based on the observation that feed-forward predictions are fast but do
not always align with the input images, we predict using both, bottom-up and
top-down streams (one per view) allowing information to flow in both
directions. Learning relies only on synthetic 3D data. Once learned, the model
can take a variable number of frames as input, and is able to reconstruct
shapes even from a single image with an accuracy of 6mm. Results on 3 different
datasets demonstrate the efficacy and accuracy of our approach
Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction
Reconstructing 3D clothed human avatars from single images is a challenging
task, especially when encountering complex poses and loose clothing. Current
methods exhibit limitations in performance, largely attributable to their
dependence on insufficient 2D image features and inconsistent query methods.
Owing to this, we present the Global-correlated 3D-decoupling Transformer for
clothed Avatar reconstruction (GTA), a novel transformer-based architecture
that reconstructs clothed human avatars from monocular images. Our approach
leverages transformer architectures by utilizing a Vision Transformer model as
an encoder for capturing global-correlated image features. Subsequently, our
innovative 3D-decoupling decoder employs cross-attention to decouple tri-plane
features, using learnable embeddings as queries for cross-plane generation. To
effectively enhance feature fusion with the tri-plane 3D feature and human body
prior, we propose a hybrid prior fusion strategy combining spatial and
prior-enhanced queries, leveraging the benefits of spatial localization and
human body prior knowledge. Comprehensive experiments on CAPE and THuman2.0
datasets illustrate that our method outperforms state-of-the-art approaches in
both geometry and texture reconstruction, exhibiting high robustness to
challenging poses and loose clothing, and producing higher-resolution textures.
Codes will be available at https://github.com/River-Zhang/GTA.Comment: Accepted by NeurIPS 2023. Project page:
https://river-zhang.github.io/GTA-projectpage
Recommended from our members
Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop
We introduce an automatic, end-to-end method for recovering the 3D pose and
shape of dogs from monocular internet images. The large variation in shape
between dog breeds, significant occlusion and low quality of internet images
makes this a challenging problem. We learn a richer prior over shapes than
previous work, which helps regularize parameter estimation. We demonstrate
results on the Stanford Dog dataset, an 'in the wild' dataset of 20,580 dog
images for which we have collected 2D joint and silhouette annotations to split
for training and evaluation. In order to capture the large shape variety of
dogs, we show that the natural variation in the 2D dataset is enough to learn a
detailed 3D prior through expectation maximization (EM). As a by-product of
training, we generate a new parameterized model (including limb scaling) SMBLD
which we release alongside our new annotation dataset StanfordExtra to the
research community.GS
Accuracy of Anthropometric Measurements by a Video-based 3D Modelling Technique
The use of anthropometric measurements, to understand an individual’s body shape and size, is an increasingly common approach in health assessment, product design, and biomechanical analysis. Non-contact, three-dimensional (3D) scanning, which can obtain individual human models, has been
widely used as a tool for automatic anthropometric measurement. Recently,
Alldieck et al. (2018) developed a video-based 3D modelling technique, enabling
the generation of individualised human models for virtual reality purposes. As
the technique is based on standard video images, hardware requirements are minimal, increasing the flexibility of the technique’s applications. The aim of this
study was to develop an automated method for acquiring anthropometric measurements from models generated using a video-based 3D modelling technique
and to determine the accuracy of the developed method. Each participant’s anthropometry was measured manually by accredited operators as the reference values. Sequential images for each participant were captured and used as input data
to generate personal 3D models, using the video-based 3D modelling technique.
Bespoke scripts were developed to obtain corresponding anthropometric data
from generated 3D models. When comparing manual measurements and those
extracted using the developed method, the accuracy of the developed method was
shown to be a potential alternative approach of anthropometry using existing
commercial solutions. However, further development, aimed at improving modelling accuracy and processing speed, is still warranted
SIZER: A Dataset and Model for Parsing 3D Clothing and Learning Size Sensitive 3D Clothing
While models of 3D clothing learned from real data exist, no method can
predict clothing deformation as a function of garment size. In this paper, we
introduce SizerNet to predict 3D clothing conditioned on human body shape and
garment size parameters, and ParserNet to infer garment meshes and shape under
clothing with personal details in a single pass from an input mesh. SizerNet
allows to estimate and visualize the dressing effect of a garment in various
sizes, and ParserNet allows to edit clothing of an input mesh directly,
removing the need for scan segmentation, which is a challenging problem in
itself. To learn these models, we introduce the SIZER dataset of clothing size
variation which includes different subjects wearing casual clothing items
in various sizes, totaling to approximately 2000 scans. This dataset includes
the scans, registrations to the SMPL model, scans segmented in clothing parts,
garment category and size labels. Our experiments show better parsing accuracy
and size prediction than baseline methods trained on SIZER. The code, model and
dataset will be released for research purposes.Comment: European Conference on Computer Vision 202