17 research outputs found
TexPose: Neural Texture Learning for Self-Supervised 6D Object Pose Estimation
In this paper, we introduce neural texture learning for 6D object pose
estimation from synthetic data and a few unlabelled real images. Our major
contribution is a novel learning scheme which removes the drawbacks of previous
works, namely the strong dependency on co-modalities or additional refinement.
These have been previously necessary to provide training signals for
convergence. We formulate such a scheme as two sub-optimisation problems on
texture learning and pose learning. We separately learn to predict realistic
texture of objects from real image collections and learn pose estimation from
pixel-perfect synthetic data. Combining these two capabilities allows then to
synthesise photorealistic novel views to supervise the pose estimator with
accurate geometry. To alleviate pose noise and segmentation imperfection
present during the texture learning phase, we propose a surfel-based
adversarial training loss together with texture regularisation from synthetic
data. We demonstrate that the proposed approach significantly outperforms the
recent state-of-the-art methods without ground-truth pose annotations and
demonstrates substantial generalisation improvements towards unseen scenes.
Remarkably, our scheme improves the adopted pose estimators substantially even
when initialised with much inferior performance
Denoising Diffusion via Image-Based Rendering
Generating 3D scenes is a challenging open problem, which requires
synthesizing plausible content that is fully consistent in 3D space. While
recent methods such as neural radiance fields excel at view synthesis and 3D
reconstruction, they cannot synthesize plausible details in unobserved regions
since they lack a generative capability. Conversely, existing generative
methods are typically not capable of reconstructing detailed, large-scale
scenes in the wild, as they use limited-capacity 3D scene representations,
require aligned camera poses, or rely on additional regularizers. In this work,
we introduce the first diffusion model able to perform fast, detailed
reconstruction and generation of real-world 3D scenes. To achieve this, we make
three contributions. First, we introduce a new neural scene representation,
IB-planes, that can efficiently and accurately represent large 3D scenes,
dynamically allocating more capacity as needed to capture details visible in
each image. Second, we propose a denoising-diffusion framework to learn a prior
over this novel 3D scene representation, using only 2D images without the need
for any additional supervision signal such as masks or depths. This supports 3D
reconstruction and generation in a unified architecture. Third, we develop a
principled approach to avoid trivial 3D solutions when integrating the
image-based rendering with the diffusion model, by dropping out representations
of some images. We evaluate the model on several challenging datasets of real
and synthetic images, and demonstrate superior results on generation, novel
view synthesis and 3D reconstruction.Comment: Accepted at ICLR 2024. Project page:
https://anciukevicius.github.io/generative-image-based-renderin
SPARF: Neural Radiance Fields from Sparse and Noisy Poses
Neural Radiance Field (NeRF) has recently emerged as a powerful
representation to synthesize photorealistic novel views. While showing
impressive performance, it relies on the availability of dense input views with
highly accurate camera poses, thus limiting its application in real-world
scenarios. In this work, we introduce Sparse Pose Adjusting Radiance Field
(SPARF), to address the challenge of novel-view synthesis given only few
wide-baseline input images (as low as 3) with noisy camera poses. Our approach
exploits multi-view geometry constraints in order to jointly learn the NeRF and
refine the camera poses. By relying on pixel matches extracted between the
input views, our multi-view correspondence objective enforces the optimized
scene and camera poses to converge to a global and geometrically accurate
solution. Our depth consistency loss further encourages the reconstructed scene
to be consistent from any viewpoint. Our approach sets a new state of the art
in the sparse-view regime on multiple challenging datasets.Comment: Code will be released upon publicatio
View-to-Label: Multi-View Consistency for Self-Supervised 3D Object Detection
For autonomous vehicles, driving safely is highly dependent on the capability
to correctly perceive the environment in 3D space, hence the task of 3D object
detection represents a fundamental aspect of perception. While 3D sensors
deliver accurate metric perception, monocular approaches enjoy cost and
availability advantages that are valuable in a wide range of applications.
Unfortunately, training monocular methods requires a vast amount of annotated
data. Interestingly, self-supervised approaches have recently been successfully
applied to ease the training process and unlock access to widely available
unlabelled data. While related research leverages different priors including
LIDAR scans and stereo images, such priors again limit usability. Therefore, in
this work, we propose a novel approach to self-supervise 3D object detection
purely from RGB sequences alone, leveraging multi-view constraints and weak
labels. Our experiments on KITTI 3D dataset demonstrate performance on par with
state-of-the-art self-supervised methods using LIDAR scans or stereo images
DDF-HO: Hand-Held Object Reconstruction via Conditional Directed Distance Field
Reconstructing hand-held objects from a single RGB image is an important and
challenging problem. Existing works utilizing Signed Distance Fields (SDF)
reveal limitations in comprehensively capturing the complex hand-object
interactions, since SDF is only reliable within the proximity of the target,
and hence, infeasible to simultaneously encode local hand and object cues. To
address this issue, we propose DDF-HO, a novel approach leveraging Directed
Distance Field (DDF) as the shape representation. Unlike SDF, DDF maps a ray in
3D space, consisting of an origin and a direction, to corresponding DDF values,
including a binary visibility signal determining whether the ray intersects the
objects and a distance value measuring the distance from origin to target in
the given direction. We randomly sample multiple rays and collect local to
global geometric features for them by introducing a novel 2D ray-based feature
aggregation scheme and a 3D intersection-aware hand pose embedding, combining
2D-3D features to model hand-object interactions. Extensive experiments on
synthetic and real-world datasets demonstrate that DDF-HO consistently
outperforms all baseline methods by a large margin, especially under Chamfer
Distance, with about 80% leap forward. Codes and trained models will be
released soon