In this work, we tackle the challenging problem of unsupervised video domain
adaptation (UVDA) for action recognition. We specifically focus on scenarios
with a substantial domain gap, in contrast to existing works primarily deal
with small domain gaps between labeled source domains and unlabeled target
domains. To establish a more realistic setting, we introduce a novel UVDA
scenario, denoted as Kinetics->BABEL, with a more considerable domain gap in
terms of both temporal dynamics and background shifts. To tackle the temporal
shift, i.e., action duration difference between the source and target domains,
we propose a global-local view alignment approach. To mitigate the background
shift, we propose to learn temporal order sensitive representations by temporal
order learning and background invariant representations by background
augmentation. We empirically validate that the proposed method shows
significant improvement over the existing methods on the Kinetics->BABEL
dataset with a large domain gap. The code is available at
https://github.com/KHUVLL/GLAD.Comment: This is an accepted WACV 2024 paper. Our code is available at
https://github.com/KHUVLL/GLA