12,146 research outputs found
3D Object Reconstruction from Hand-Object Interactions
Recent advances have enabled 3d object reconstruction approaches using a
single off-the-shelf RGB-D camera. Although these approaches are successful for
a wide range of object classes, they rely on stable and distinctive geometric
or texture features. Many objects like mechanical parts, toys, household or
decorative articles, however, are textureless and characterized by minimalistic
shapes that are simple and symmetric. Existing in-hand scanning systems and 3d
reconstruction techniques fail for such symmetric objects in the absence of
highly distinctive features. In this work, we show that extracting 3d hand
motion for in-hand scanning effectively facilitates the reconstruction of even
featureless and highly symmetric objects and we present an approach that fuses
the rich additional information of hands into a 3d reconstruction pipeline,
significantly contributing to the state-of-the-art of in-hand scanning.Comment: International Conference on Computer Vision (ICCV) 2015,
http://files.is.tue.mpg.de/dtzionas/In-Hand-Scannin
Learning to Reconstruct People in Clothing from a Single RGB Camera
We present a learning-based model to infer the personalized 3D shape of people from a few frames (1-8) of a monocular video in which the person is moving, in less than 10 seconds with a reconstruction accuracy of 5mm. Our model learns to predict the parameters of a statistical body model and instance displacements that add clothing and hair to the shape. The model achieves fast and accurate predictions based on two key design choices. First, by predicting shape in a canonical T-pose space, the network learns to encode the images of the person into pose-invariant latent codes, where the information is fused. Second, based on the observation that feed-forward predictions are fast but do not always align with the input images, we predict using both, bottom-up and top-down streams (one per view) allowing information to flow in both directions. Learning relies only on synthetic 3D data. Once learned, the model can take a variable number of frames as input, and is able to reconstruct shapes even from a single image with an accuracy of 6mm. Results on 3 different datasets demonstrate the efficacy and accuracy of our approach
A comparative study of breast surface reconstruction for aesthetic outcome assessment
Breast cancer is the most prevalent cancer type in women, and while its
survival rate is generally high the aesthetic outcome is an increasingly
important factor when evaluating different treatment alternatives. 3D scanning
and reconstruction techniques offer a flexible tool for building detailed and
accurate 3D breast models that can be used both pre-operatively for surgical
planning and post-operatively for aesthetic evaluation. This paper aims at
comparing the accuracy of low-cost 3D scanning technologies with the
significantly more expensive state-of-the-art 3D commercial scanners in the
context of breast 3D reconstruction. We present results from 28 synthetic and
clinical RGBD sequences, including 12 unique patients and an anthropomorphic
phantom demonstrating the applicability of low-cost RGBD sensors to real
clinical cases. Body deformation and homogeneous skin texture pose challenges
to the studied reconstruction systems. Although these should be addressed
appropriately if higher model quality is warranted, we observe that low-cost
sensors are able to obtain valuable reconstructions comparable to the
state-of-the-art within an error margin of 3 mm.Comment: This paper has been accepted to MICCAI201
Panoptic Vision-Language Feature Fields
Recently, methods have been proposed for 3D open-vocabulary semantic
segmentation. Such methods are able to segment scenes into arbitrary classes
given at run-time using their text description. In this paper, we propose to
our knowledge the first algorithm for open-vocabulary panoptic segmentation,
simultaneously performing both semantic and instance segmentation. Our
algorithm, Panoptic Vision-Language Feature Fields (PVLFF) learns a feature
field of the scene, jointly learning vision-language features and hierarchical
instance features through a contrastive loss function from 2D instance segment
proposals on input frames. Our method achieves comparable performance against
the state-of-the-art close-set 3D panoptic systems on the HyperSim, ScanNet and
Replica dataset and outperforms current 3D open-vocabulary systems in terms of
semantic segmentation. We additionally ablate our method to demonstrate the
effectiveness of our model architecture. Our code will be available at
https://github.com/ethz-asl/autolabel.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
Generalized Product-of-Experts for Learning Multimodal Representations in Noisy Environments
A real-world application or setting involves interaction between different
modalities (e.g., video, speech, text). In order to process the multimodal
information automatically and use it for an end application, Multimodal
Representation Learning (MRL) has emerged as an active area of research in
recent times. MRL involves learning reliable and robust representations of
information from heterogeneous sources and fusing them. However, in practice,
the data acquired from different sources are typically noisy. In some extreme
cases, a noise of large magnitude can completely alter the semantics of the
data leading to inconsistencies in the parallel multimodal data. In this paper,
we propose a novel method for multimodal representation learning in a noisy
environment via the generalized product of experts technique. In the proposed
method, we train a separate network for each modality to assess the credibility
of information coming from that modality, and subsequently, the contribution
from each modality is dynamically varied while estimating the joint
distribution. We evaluate our method on two challenging benchmarks from two
diverse domains: multimodal 3D hand-pose estimation and multimodal surgical
video segmentation. We attain state-of-the-art performance on both benchmarks.
Our extensive quantitative and qualitative evaluations show the advantages of
our method compared to previous approaches.Comment: 11 Pages, Accepted at ICMI 2022 Ora
SurfelNeRF: Neural Surfel Radiance Fields for Online Photorealistic Reconstruction of Indoor Scenes
Online reconstructing and rendering of large-scale indoor scenes is a
long-standing challenge. SLAM-based methods can reconstruct 3D scene geometry
progressively in real time but can not render photorealistic results. While
NeRF-based methods produce promising novel view synthesis results, their long
offline optimization time and lack of geometric constraints pose challenges to
efficiently handling online input. Inspired by the complementary advantages of
classical 3D reconstruction and NeRF, we thus investigate marrying explicit
geometric representation with NeRF rendering to achieve efficient online
reconstruction and high-quality rendering. We introduce SurfelNeRF, a variant
of neural radiance field which employs a flexible and scalable neural surfel
representation to store geometric attributes and extracted appearance features
from input images. We further extend the conventional surfel-based fusion
scheme to progressively integrate incoming input frames into the reconstructed
global neural scene representation. In addition, we propose a highly-efficient
differentiable rasterization scheme for rendering neural surfel radiance
fields, which helps SurfelNeRF achieve speedups in training and
inference time, respectively. Experimental results show that our method
achieves the state-of-the-art 23.82 PSNR and 29.58 PSNR on ScanNet in
feedforward inference and per-scene optimization settings, respectively.Comment: To appear in CVPR 202
- …