6,984 research outputs found
Self-Supervised Deep Visual Odometry with Online Adaptation
Self-supervised VO methods have shown great success in jointly estimating
camera pose and depth from videos. However, like most data-driven methods,
existing VO networks suffer from a notable decrease in performance when
confronted with scenes different from the training data, which makes them
unsuitable for practical applications. In this paper, we propose an online
meta-learning algorithm to enable VO networks to continuously adapt to new
environments in a self-supervised manner. The proposed method utilizes
convolutional long short-term memory (convLSTM) to aggregate rich
spatial-temporal information in the past. The network is able to memorize and
learn from its past experience for better estimation and fast adaptation to the
current frame. When running VO in the open world, in order to deal with the
changing environment, we propose an online feature alignment method by aligning
feature distributions at different time. Our VO network is able to seamlessly
adapt to different environments. Extensive experiments on unseen outdoor
scenes, virtual to real world and outdoor to indoor environments demonstrate
that our method consistently outperforms state-of-the-art self-supervised VO
baselines considerably.Comment: Accepted by CVPR 2020 ora
PoliTO-IIT Submission to the EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition
In this report, we describe the technical details of our submission to the
EPIC-Kitchens-100 Unsupervised Domain Adaptation (UDA) Challenge in Action
Recognition. To tackle the domain-shift which exists under the UDA setting, we
first exploited a recent Domain Generalization (DG) technique, called Relative
Norm Alignment (RNA). It consists in designing a model able to generalize well
to any unseen domain, regardless of the possibility to access target data at
training time. Then, in a second phase, we extended the approach to work on
unlabelled target data, allowing the model to adapt to the target distribution
in an unsupervised fashion. For this purpose, we included in our framework
existing UDA algorithms, such as Temporal Attentive Adversarial Adaptation
Network (TA3N), jointly with new multi-stream consistency losses, namely
Temporal Hard Norm Alignment (T-HNA) and Min-Entropy Consistency (MEC). Our
submission (entry 'plnet') is visible on the leaderboard and it achieved the
1st position for 'verb', and the 3rd position for both 'noun' and 'action'.Comment: 3rd place in the 2021 EPIC-KITCHENS-100 Unsupervised Domain
Adaptation Challenge for Action Recognitio
Interpretable and Generalizable Person Re-Identification with Query-Adaptive Convolution and Temporal Lifting
For person re-identification, existing deep networks often focus on
representation learning. However, without transfer learning, the learned model
is fixed as is, which is not adaptable for handling various unseen scenarios.
In this paper, beyond representation learning, we consider how to formulate
person image matching directly in deep feature maps. We treat image matching as
finding local correspondences in feature maps, and construct query-adaptive
convolution kernels on the fly to achieve local matching. In this way, the
matching process and results are interpretable, and this explicit matching is
more generalizable than representation features to unseen scenarios, such as
unknown misalignments, pose or viewpoint changes. To facilitate end-to-end
training of this architecture, we further build a class memory module to cache
feature maps of the most recent samples of each class, so as to compute image
matching losses for metric learning. Through direct cross-dataset evaluation,
the proposed Query-Adaptive Convolution (QAConv) method gains large
improvements over popular learning methods (about 10%+ mAP), and achieves
comparable results to many transfer learning methods. Besides, a model-free
temporal cooccurrence based score weighting method called TLift is proposed,
which improves the performance to a further extent, achieving state-of-the-art
results in cross-dataset person re-identification. Code is available at
https://github.com/ShengcaiLiao/QAConv.Comment: This is the ECCV 2020 version, including the appendi
Leveraging Endo- and Exo-Temporal Regularization for Black-box Video Domain Adaptation
To enable video models to be applied seamlessly across video tasks in
different environments, various Video Unsupervised Domain Adaptation (VUDA)
methods have been proposed to improve the robustness and transferability of
video models. Despite improvements made in model robustness, these VUDA methods
require access to both source data and source model parameters for adaptation,
raising serious data privacy and model portability issues. To cope with the
above concerns, this paper firstly formulates Black-box Video Domain Adaptation
(BVDA) as a more realistic yet challenging scenario where the source video
model is provided only as a black-box predictor. While a few methods for
Black-box Domain Adaptation (BDA) are proposed in image domain, these methods
cannot apply to video domain since video modality has more complicated temporal
features that are harder to align. To address BVDA, we propose a novel Endo and
eXo-TEmporal Regularized Network (EXTERN) by applying mask-to-mix strategies
and video-tailored regularizations: endo-temporal regularization and
exo-temporal regularization, performed across both clip and temporal features,
while distilling knowledge from the predictions obtained from the black-box
predictor. Empirical results demonstrate the state-of-the-art performance of
EXTERN across various cross-domain closed-set and partial-set action
recognition benchmarks, which even surpassed most existing video domain
adaptation methods with source data accessibility.Comment: 9 pages, 4 figures, and 4 table
- …