1,461 research outputs found
Unsupervised Domain Adaptation through Inter-Modal Rotation for RGB-D Object Recognition
Unsupervised Domain Adaptation (DA) exploits the supervision of a label-rich source dataset to make predictions on an unlabeled target dataset by aligning the two data distributions. In robotics, DA is used to take advantage of automatically generated synthetic data, that come with 'free' annotation, to make effective predictions on real data. However, existing DA methods are not designed to cope with the multi-modal nature of RGB-D data, which are widely used in robotic vision. We propose a novel RGB-D DA method that reduces the synthetic-to-real domain shift by exploiting the inter-modal relation between the RGB and depth image. Our method consists of training a convolutional neural network to solve, in addition to the main recognition task, the pretext task of predicting the relative rotation between the RGB and depth image. To evaluate our method and encourage further research in this area, we define two benchmark datasets for object categorization and instance recognition. With extensive experiments, we show the benefits of leveraging the inter-modal relations for RGB-D DA. The code is available at: 'https://github.com/MRLoghmani/relative-rotation'
Recent Advances in Transfer Learning for Cross-Dataset Visual Recognition: A Problem-Oriented Perspective
This paper takes a problem-oriented perspective and presents a comprehensive
review of transfer learning methods, both shallow and deep, for cross-dataset
visual recognition. Specifically, it categorises the cross-dataset recognition
into seventeen problems based on a set of carefully chosen data and label
attributes. Such a problem-oriented taxonomy has allowed us to examine how
different transfer learning approaches tackle each problem and how well each
problem has been researched to date. The comprehensive problem-oriented review
of the advances in transfer learning with respect to the problem has not only
revealed the challenges in transfer learning for visual recognition, but also
the problems (e.g. eight of the seventeen problems) that have been scarcely
studied. This survey not only presents an up-to-date technical review for
researchers, but also a systematic approach and a reference for a machine
learning practitioner to categorise a real problem and to look up for a
possible solution accordingly
Multi-Modal RGB-D Scene Recognition Across Domains
Scene recognition is one of the basic problems in computer vision research with extensive applications in robotics. When available, depth images provide helpful geometric cues that complement the RGB texture information and help to identify discriminative scene image features.
Depth sensing technology developed fast in the last years and a great variety of 3D cameras have been introduced, each with different acquisition properties. However, those properties are often neglected when targeting big data collections, so multi-modal images are gathered disregarding their original nature. In this work, we put under the spotlight the existence of a possibly severe domain shift issue within multi-modality scene recognition datasets. As a consequence, a scene classification model trained on one camera may not generalize on data from a different camera, only providing a low recognition performance. Starting from the well-known SUN RGB-D dataset, we designed an experimental testbed to study this problem and we use it to benchmark the performance of existing methods.
Finally, we introduce a novel adaptive scene recognition approach that leverages self-supervised translation between modalities. Indeed, learning to go from RGB to depth and vice-versa is an unsupervised procedure that can be trained jointly on data of multiple cameras and may help to bridge the gap among the extracted feature distributions. Our experimental results confirm the effectiveness of the proposed approach
Multi-Modal Domain Adaptation for Fine-Grained Action Recognition
Fine-grained action recognition datasets exhibit environmental bias, where
multiple video sequences are captured from a limited number of environments.
Training a model in one environment and deploying in another results in a drop
in performance due to an unavoidable domain shift. Unsupervised Domain
Adaptation (UDA) approaches have frequently utilised adversarial training
between the source and target domains. However, these approaches have not
explored the multi-modal nature of video within each domain. In this work we
exploit the correspondence of modalities as a self-supervised alignment
approach for UDA in addition to adversarial alignment.
We test our approach on three kitchens from our large-scale dataset,
EPIC-Kitchens, using two modalities commonly employed for action recognition:
RGB and Optical Flow. We show that multi-modal self-supervision alone improves
the performance over source-only training by 2.4% on average. We then combine
adversarial training with multi-modal self-supervision, showing that our
approach outperforms other UDA methods by 3%.Comment: Accepted to CVPR 2020 for an oral presentatio
Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive Survey and Evaluation
Multi-modal 3D scene understanding has gained considerable attention due to
its wide applications in many areas, such as autonomous driving and
human-computer interaction. Compared to conventional single-modal 3D
understanding, introducing an additional modality not only elevates the
richness and precision of scene interpretation but also ensures a more robust
and resilient understanding. This becomes especially crucial in varied and
challenging environments where solely relying on 3D data might be inadequate.
While there has been a surge in the development of multi-modal 3D methods over
past three years, especially those integrating multi-camera images (3D+2D) and
textual descriptions (3D+language), a comprehensive and in-depth review is
notably absent. In this article, we present a systematic survey of recent
progress to bridge this gap. We begin by briefly introducing a background that
formally defines various 3D multi-modal tasks and summarizes their inherent
challenges. After that, we present a novel taxonomy that delivers a thorough
categorization of existing methods according to modalities and tasks, exploring
their respective strengths and limitations. Furthermore, comparative results of
recent approaches on several benchmark datasets, together with insightful
analysis, are offered. Finally, we discuss the unresolved issues and provide
several potential avenues for future research
- …