250 research outputs found
A Light Touch Approach to Teaching Transformers Multi-view Geometry
Transformers are powerful visual learners, in large part due to their
conspicuous lack of manually-specified priors. This flexibility can be
problematic in tasks that involve multiple-view geometry, due to the
near-infinite possible variations in 3D shapes and viewpoints (requiring
flexibility), and the precise nature of projective geometry (obeying rigid
laws). To resolve this conundrum, we propose a "light touch" approach, guiding
visual Transformers to learn multiple-view geometry but allowing them to break
free when needed. We achieve this by using epipolar lines to guide the
Transformer's cross-attention maps, penalizing attention values outside the
epipolar lines and encouraging higher attention along these lines since they
contain geometrically plausible matches. Unlike previous methods, our proposal
does not require any camera pose information at test-time. We focus on
pose-invariant object instance retrieval, where standard Transformer networks
struggle, due to the large differences in viewpoint between query and retrieved
images. Experimentally, our method outperforms state-of-the-art approaches at
object retrieval, without needing pose information at test-time
PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation
Recently, the vision transformer and its variants have played an increasingly
important role in both monocular and multi-view human pose estimation.
Considering image patches as tokens, transformers can model the global
dependencies within the entire image or across images from other views.
However, global attention is computationally expensive. As a consequence, it is
difficult to scale up these transformer-based methods to high-resolution
features and many views.
In this paper, we propose the token-Pruned Pose Transformer (PPT) for 2D
human pose estimation, which can locate a rough human mask and performs
self-attention only within selected tokens. Furthermore, we extend our PPT to
multi-view human pose estimation. Built upon PPT, we propose a new cross-view
fusion strategy, called human area fusion, which considers all human foreground
pixels as corresponding candidates. Experimental results on COCO and MPII
demonstrate that our PPT can match the accuracy of previous pose transformer
methods while reducing the computation. Moreover, experiments on Human 3.6M and
Ski-Pose demonstrate that our Multi-view PPT can efficiently fuse cues from
multiple views and achieve new state-of-the-art results.Comment: ECCV 2022. Code is available at https://github.com/HowieMa/PP
Two-View Geometry Scoring Without Correspondences
Camera pose estimation for two-view geometry traditionally relies on RANSAC. Normally, a multitude of image correspondences leads to a pool of proposed hypotheses, which are then scored to find a winning model. The inlier count is generally regarded as a reliable indicator of 'consensus'. We examine this scoring heuristic, and find that it favors disappointing models under certain circumstances. As a remedy, we propose the Fundamental Scoring Network (FSNet), which infers a score for a pair of overlap-ping images and any proposed fundamental matrix. It does not rely on sparse correspondences, but rather embodies a two-view geometry model through an epipolar attention mechanism that predicts the pose error of the two images. FSNet can be incorporated into traditional RANSAC loops. We evaluate FSNet onfundamental and essential matrix estimation on indoor and outdoor datasets, and establish that FSNet can successfully identify good poses for pairs of images with few or unreliable correspondences. Besides, we show that naively combining FSNet with MAGSAC++ scoring approach achieves state of the art results
Structured Epipolar Matcher for Local Feature Matching
Local feature matching is challenging due to the textureless and repetitive
pattern. Existing methods foucs on using appearance features and global
interaction and matching, while the importance of geometry prior in local
feature matching has not been fully exploited. Different from these methods, in
this paper, we delve into the importance of geometry prior and propose
Structured Epipolar Matcher (SEM) for local feature matching, which can
leverage the geometric information in a iterative matching way. The proposed
model enjoys several merits. First, our proposed Structured Feature Extractor
can model the relative positional relationship between pixels and
high-confidence anchor points. Second, our proposed Epipolar Attention and
Matching can filter out irrelevant areas by utilizing the epipolar constraint.
Extensive experimental results on five standard benchmarks demonstrate the
superior performance of our SEM compared to state-of-the-art methods
DVGaze: Dual-View Gaze Estimation
Gaze estimation methods estimate gaze from facial appearance with a single
camera. However, due to the limited view of a single camera, the captured
facial appearance cannot provide complete facial information and thus
complicate the gaze estimation problem. Recently, camera devices are rapidly
updated. Dual cameras are affordable for users and have been integrated in many
devices. This development suggests that we can further improve gaze estimation
performance with dual-view gaze estimation. In this paper, we propose a
dual-view gaze estimation network (DV-Gaze). DV-Gaze estimates dual-view gaze
directions from a pair of images. We first propose a dual-view interactive
convolution (DIC) block in DV-Gaze. DIC blocks exchange dual-view information
during convolution in multiple feature scales. It fuses dual-view features
along epipolar lines and compensates for the original feature with the fused
feature. We further propose a dual-view transformer to estimate gaze from
dual-view features. Camera poses are encoded to indicate the position
information in the transformer. We also consider the geometric relation between
dual-view gaze directions and propose a dual-view gaze consistency loss for
DV-Gaze. DV-Gaze achieves state-of-the-art performance on ETH-XGaze and EVE
datasets. Our experiments also prove the potential of dual-view gaze
estimation. We release codes in https://github.com/yihuacheng/DVGaze.Comment: ICCV 202
Two-View Geometry Scoring Without Correspondences
Camera pose estimation for two-view geometry traditionally relies on RANSAC.
Normally, a multitude of image correspondences leads to a pool of proposed
hypotheses, which are then scored to find a winning model. The inlier count is
generally regarded as a reliable indicator of "consensus". We examine this
scoring heuristic, and find that it favors disappointing models under certain
circumstances. As a remedy, we propose the Fundamental Scoring Network (FSNet),
which infers a score for a pair of overlapping images and any proposed
fundamental matrix. It does not rely on sparse correspondences, but rather
embodies a two-view geometry model through an epipolar attention mechanism that
predicts the pose error of the two images. FSNet can be incorporated into
traditional RANSAC loops. We evaluate FSNet on fundamental and essential matrix
estimation on indoor and outdoor datasets, and establish that FSNet can
successfully identify good poses for pairs of images with few or unreliable
correspondences. Besides, we show that naively combining FSNet with MAGSAC++
scoring approach achieves state of the art results
RayMVSNet++: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo
Learning-based multi-view stereo (MVS) has by far centered around 3D
convolution on cost volumes. Due to the high computation and memory consumption
of 3D CNN, the resolution of output depth is often considerably limited.
Different from most existing works dedicated to adaptive refinement of cost
volumes, we opt to directly optimize the depth value along each camera ray,
mimicking the range finding of a laser scanner. This reduces the MVS problem to
ray-based depth optimization which is much more light-weight than full cost
volume optimization. In particular, we propose RayMVSNet which learns
sequential prediction of a 1D implicit field along each camera ray with the
zero-crossing point indicating scene depth. This sequential modeling, conducted
based on transformer features, essentially learns the epipolar line search in
traditional multi-view stereo. We devise a multi-task learning for better
optimization convergence and depth accuracy. We found the monotonicity property
of the SDFs along each ray greatly benefits the depth estimation. Our method
ranks top on both the DTU and the Tanks & Temples datasets over all previous
learning-based methods, achieving an overall reconstruction score of 0.33mm on
DTU and an F-score of 59.48% on Tanks & Temples. It is able to produce
high-quality depth estimation and point cloud reconstruction in challenging
scenarios such as objects/scenes with non-textured surface, severe occlusion,
and highly varying depth range. Further, we propose RayMVSNet++ to enhance
contextual feature aggregation for each ray through designing an attentional
gating unit to select semantically relevant neighboring rays within the local
frustum around that ray. RayMVSNet++ achieves state-of-the-art performance on
the ScanNet dataset. In particular, it attains an AbsRel of 0.058m and produces
accurate results on the two subsets of textureless regions and large depth
variation.Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence. arXiv
admin note: substantial text overlap with arXiv:2204.0132
SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction
We propose SparseFusion, a sparse view 3D reconstruction approach that
unifies recent advances in neural rendering and probabilistic image generation.
Existing approaches typically build on neural rendering with re-projected
features but fail to generate unseen regions or handle uncertainty under large
viewpoint changes. Alternate methods treat this as a (probabilistic) 2D
synthesis task, and while they can generate plausible 2D images, they do not
infer a consistent underlying 3D. However, we find that this trade-off between
3D consistency and probabilistic image generation does not need to exist. In
fact, we show that geometric consistency and generative inference can be
complementary in a mode-seeking behavior. By distilling a 3D consistent scene
representation from a view-conditioned latent diffusion model, we are able to
recover a plausible 3D representation whose renderings are both accurate and
realistic. We evaluate our approach across 51 categories in the CO3D dataset
and show that it outperforms existing methods, in both distortion and
perception metrics, for sparse-view novel view synthesis.Comment: project page: https://sparsefusion.github.io/ v2: typo corrected in
table 3 v3: added ablatio
- …