To enable video models to be applied seamlessly across video tasks in
different environments, various Video Unsupervised Domain Adaptation (VUDA)
methods have been proposed to improve the robustness and transferability of
video models. Despite improvements made in model robustness, these VUDA methods
require access to both source data and source model parameters for adaptation,
raising serious data privacy and model portability issues. To cope with the
above concerns, this paper firstly formulates Black-box Video Domain Adaptation
(BVDA) as a more realistic yet challenging scenario where the source video
model is provided only as a black-box predictor. While a few methods for
Black-box Domain Adaptation (BDA) are proposed in image domain, these methods
cannot apply to video domain since video modality has more complicated temporal
features that are harder to align. To address BVDA, we propose a novel Endo and
eXo-TEmporal Regularized Network (EXTERN) by applying mask-to-mix strategies
and video-tailored regularizations: endo-temporal regularization and
exo-temporal regularization, performed across both clip and temporal features,
while distilling knowledge from the predictions obtained from the black-box
predictor. Empirical results demonstrate the state-of-the-art performance of
EXTERN across various cross-domain closed-set and partial-set action
recognition benchmarks, which even surpassed most existing video domain
adaptation methods with source data accessibility.Comment: 9 pages, 4 figures, and 4 table