20 research outputs found
Multi-Modal Domain Adaptation for Fine-Grained Action Recognition
Fine-grained action recognition datasets exhibit environmental bias, where
multiple video sequences are captured from a limited number of environments.
Training a model in one environment and deploying in another results in a drop
in performance due to an unavoidable domain shift. Unsupervised Domain
Adaptation (UDA) approaches have frequently utilised adversarial training
between the source and target domains. However, these approaches have not
explored the multi-modal nature of video within each domain. In this work we
exploit the correspondence of modalities as a self-supervised alignment
approach for UDA in addition to adversarial alignment.
We test our approach on three kitchens from our large-scale dataset,
EPIC-Kitchens, using two modalities commonly employed for action recognition:
RGB and Optical Flow. We show that multi-modal self-supervision alone improves
the performance over source-only training by 2.4% on average. We then combine
adversarial training with multi-modal self-supervision, showing that our
approach outperforms other UDA methods by 3%.Comment: Accepted to CVPR 2020 for an oral presentatio
Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition
Wearable cameras are becoming more and more popular in several applications, increasing the interest of the research community in developing approaches for recognizing actions from the first-person point of view. An open challenge in egocentric action recognition is that videos lack detailed information about the main actor's pose and thus tend to record only parts of the movement when focusing on manipulation tasks. Thus, the amount of information about the action itself is limited, making crucial the understanding of the manipulated objects and their context. Many previous works addressed this issue with two-stream architectures, where one stream is dedicated to modeling the appearance of objects involved in the action, and another to extracting motion features from optical flow. In this paper, we argue that learning features jointly from these two information channels is beneficial to capture the spatio-temporal correlations between the two better. To this end, we propose a single stream architecture able to do so, thanks to the addition of a self-supervised block that uses a pretext motion prediction task to intertwine motion and appearance knowledge. Experiments on several publicly available databases show the power of our approach
PoliTO-IIT Submission to the EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition
In this report, we describe the technical details of our submission to the
EPIC-Kitchens-100 Unsupervised Domain Adaptation (UDA) Challenge in Action
Recognition. To tackle the domain-shift which exists under the UDA setting, we
first exploited a recent Domain Generalization (DG) technique, called Relative
Norm Alignment (RNA). It consists in designing a model able to generalize well
to any unseen domain, regardless of the possibility to access target data at
training time. Then, in a second phase, we extended the approach to work on
unlabelled target data, allowing the model to adapt to the target distribution
in an unsupervised fashion. For this purpose, we included in our framework
existing UDA algorithms, such as Temporal Attentive Adversarial Adaptation
Network (TA3N), jointly with new multi-stream consistency losses, namely
Temporal Hard Norm Alignment (T-HNA) and Min-Entropy Consistency (MEC). Our
submission (entry 'plnet') is visible on the leaderboard and it achieved the
1st position for 'verb', and the 3rd position for both 'noun' and 'action'.Comment: 3rd place in the 2021 EPIC-KITCHENS-100 Unsupervised Domain
Adaptation Challenge for Action Recognitio
Contrastive Learning for Unsupervised Domain Adaptation of Time Series
Unsupervised domain adaptation (UDA) aims at learning a machine learning
model using a labeled source domain that performs well on a similar yet
different, unlabeled target domain. UDA is important in many applications such
as medicine, where it is used to adapt risk scores across different patient
cohorts. In this paper, we develop a novel framework for UDA of time series
data, called CLUDA. Specifically, we propose a contrastive learning framework
to learn contextual representations in multivariate time series, so that these
preserve label information for the prediction task. In our framework, we
further capture the variation in the contextual representations between source
and target domain via a custom nearest-neighbor contrastive learning. To the
best of our knowledge, ours is the first framework to learn domain-invariant,
contextual representation for UDA of time series data. We evaluate our
framework using a wide range of time series datasets to demonstrate its
effectiveness and show that it achieves state-of-the-art performance for time
series UDA.Comment: Published as a conference paper at ICLR 202
Unsupervised Video Domain Adaptation for Action Recognition: A Disentanglement Perspective
Unsupervised video domain adaptation is a practical yet challenging task. In
this work, for the first time, we tackle it from a disentanglement view. Our
key idea is to handle the spatial and temporal domain divergence separately
through disentanglement. Specifically, we consider the generation of
cross-domain videos from two sets of latent factors, one encoding the static
information and another encoding the dynamic information. A Transfer Sequential
VAE (TranSVAE) framework is then developed to model such generation. To better
serve for adaptation, we propose several objectives to constrain the latent
factors. With these constraints, the spatial divergence can be readily removed
by disentangling the static domain-specific information out, and the temporal
divergence is further reduced from both frame- and video-levels through
adversarial learning. Extensive experiments on the UCF-HMDB, Jester, and
Epic-Kitchens datasets verify the effectiveness and superiority of TranSVAE
compared with several state-of-the-art methods. The code with reproducible
results is publicly accessible.Comment: 18 pages, 9 figures, 7 tables. Code at
https://github.com/ldkong1205/TranSVA