2,856 research outputs found
Generic 3D Representation via Pose Estimation and Matching
Though a large body of computer vision research has investigated developing
generic semantic representations, efforts towards developing a similar
representation for 3D has been limited. In this paper, we learn a generic 3D
representation through solving a set of foundational proxy 3D tasks:
object-centric camera pose estimation and wide baseline feature matching. Our
method is based upon the premise that by providing supervision over a set of
carefully selected foundational tasks, generalization to novel tasks and
abstraction capabilities can be achieved. We empirically show that the internal
representation of a multi-task ConvNet trained to solve the above core problems
generalizes to novel 3D tasks (e.g., scene layout estimation, object pose
estimation, surface normal estimation) without the need for fine-tuning and
shows traits of abstraction abilities (e.g., cross-modality pose estimation).
In the context of the core supervised tasks, we demonstrate our representation
achieves state-of-the-art wide baseline feature matching results without
requiring apriori rectification (unlike SIFT and the majority of learned
features). We also show 6DOF camera pose estimation given a pair local image
patches. The accuracy of both supervised tasks come comparable to humans.
Finally, we contribute a large-scale dataset composed of object-centric street
view scenes along with point correspondences and camera pose information, and
conclude with a discussion on the learned representation and open research
questions.Comment: Published in ECCV16. See the project website
http://3drepresentation.stanford.edu/ and dataset website
https://github.com/amir32002/3D_Street_Vie
Learning Unseen Modality Interaction
Multimodal learning assumes all modality combinations of interest are
available during training to learn cross-modal correspondences. In this paper,
we challenge this modality-complete assumption for multimodal learning and
instead strive for generalization to unseen modality combinations during
inference. We pose the problem of unseen modality interaction and introduce a
first solution. It exploits a feature projection module to project the
multidimensional features of different modalities into a common space with rich
information reserved. This allows the information to be accumulated with a
simple summation operation across available modalities. To reduce overfitting
to unreliable modality combinations during training, we further improve the
model learning with pseudo-supervision indicating the reliability of a
modality's prediction. We demonstrate that our approach is effective for
diverse tasks and modalities by evaluating it for multimodal video
classification, robot state regression, and multimedia retrieval.Comment: Under revie
Mx2M: Masked Cross-Modality Modeling in Domain Adaptation for 3D Semantic Segmentation
Existing methods of cross-modal domain adaptation for 3D semantic
segmentation predict results only via 2D-3D complementarity that is obtained by
cross-modal feature matching. However, as lacking supervision in the target
domain, the complementarity is not always reliable. The results are not ideal
when the domain gap is large. To solve the problem of lacking supervision, we
introduce masked modeling into this task and propose a method Mx2M, which
utilizes masked cross-modality modeling to reduce the large domain gap. Our
Mx2M contains two components. One is the core solution, cross-modal removal and
prediction (xMRP), which makes the Mx2M adapt to various scenarios and provides
cross-modal self-supervision. The other is a new way of cross-modal feature
matching, the dynamic cross-modal filter (DxMF) that ensures the whole method
dynamically uses more suitable 2D-3D complementarity. Evaluation of the Mx2M on
three DA scenarios, including Day/Night, USA/Singapore, and A2D2/SemanticKITTI,
brings large improvements over previous methods on many metrics
Simple to Complex Cross-modal Learning to Rank
The heterogeneity-gap between different modalities brings a significant
challenge to multimedia information retrieval. Some studies formalize the
cross-modal retrieval tasks as a ranking problem and learn a shared multi-modal
embedding space to measure the cross-modality similarity. However, previous
methods often establish the shared embedding space based on linear mapping
functions which might not be sophisticated enough to reveal more complicated
inter-modal correspondences. Additionally, current studies assume that the
rankings are of equal importance, and thus all rankings are used
simultaneously, or a small number of rankings are selected randomly to train
the embedding space at each iteration. Such strategies, however, always suffer
from outliers as well as reduced generalization capability due to their lack of
insightful understanding of procedure of human cognition. In this paper, we
involve the self-paced learning theory with diversity into the cross-modal
learning to rank and learn an optimal multi-modal embedding space based on
non-linear mapping functions. This strategy enhances the model's robustness to
outliers and achieves better generalization via training the model gradually
from easy rankings by diverse queries to more complex ones. An efficient
alternative algorithm is exploited to solve the proposed challenging problem
with fast convergence in practice. Extensive experimental results on several
benchmark datasets indicate that the proposed method achieves significant
improvements over the state-of-the-arts in this literature.Comment: 14 pages; Accepted by Computer Vision and Image Understandin
3DG-STFM: 3D Geometric Guided Student-Teacher Feature Matching
We tackle the essential task of finding dense visual correspondences between
a pair of images. This is a challenging problem due to various factors such as
poor texture, repetitive patterns, illumination variation, and motion blur in
practical scenarios. In contrast to methods that use dense correspondence
ground-truths as direct supervision for local feature matching training, we
train 3DG-STFM: a multi-modal matching model (Teacher) to enforce the depth
consistency under 3D dense correspondence supervision and transfer the
knowledge to 2D unimodal matching model (Student). Both teacher and student
models consist of two transformer-based matching modules that obtain dense
correspondences in a coarse-to-fine manner. The teacher model guides the
student model to learn RGB-induced depth information for the matching purpose
on both coarse and fine branches. We also evaluate 3DG-STFM on a model
compression task. To the best of our knowledge, 3DG-STFM is the first
student-teacher learning method for the local feature matching task. The
experiments show that our method outperforms state-of-the-art methods on indoor
and outdoor camera pose estimations, and homography estimation problems. Code
is available at: https://github.com/Ryan-prime/3DG-STFM
Audio-visual Self-Supervised Representation Learning in-the-wild
Εθνικό Μετσόβιο Πολυτεχνείο--Μεταπτυχιακή Εργασία. Διεπιστημονικό-Διατμηματικό Πρόγραμμα Μεταπτυχιακών Σπουδών (Δ.Π.Μ.Σ.) "Επιστήμη Δεδομένων και Μηχανική Μάθηση
- …