4,506 research outputs found
Video-efficient foundation models
The thesis strives to endow video-efficiency in video understanding by addressing the research question ''What enables video-efficient video foundation models?'' Video-efficiency encompasses developing video foundation models that are not only accurate but also exhibit label-efficiency i.e. require fewer labels, domain-efficiency i.e. applicable to a variety of video learning scenarios, and data-efficiency i.e. reduce the amount of video data needed for learning. The research question is addressed for RGB and non-RGB video modalities. In Chapter 2, we focus on improving the label- and domain-efficiency of non-RGB action recognition and detection. Chapter 3 introduces a new self-supervised approach for learning feature representations for 3D-skeleton video sequences. In Chapter 4, we conduct a large-scale study of existing RGB-based self-supervised video models to assess their performance across different facets of video-efficiency. Chapter 5 presents a new method for video self-supervision that explicitly aims to learn motion focused video-representations. To summarize, this thesis presents several novel approaches to improve the video-efficiency of video foundation models. Our research highlights the importance of transferring knowledge between RGB and non-RGB video modalities, exploring self-supervision for non-RGB video modeling, analyzing self-supervised models beyond canonical setups and carefully designing new self-supervised tasks to develop video foundation models that can exhibit different facets of video-efficiency. We hope that our work will inspire further research and development in this area, leading to even more video-efficient foundation models
Uncovering the Hidden Dynamics of Video Self-supervised Learning under Distribution Shifts
Video self-supervised learning (VSSL) has made significant progress in recent
years. However, the exact behavior and dynamics of these models under different
forms of distribution shift are not yet known. In this paper, we
comprehensively study the behavior of six popular self-supervised methods
(v-SimCLR, v-MoCo, v-BYOL, v-SimSiam, v-DINO, v-MAE) in response to various
forms of natural distribution shift, i.e., (i) context shift, (ii) viewpoint
shift, (iii) actor shift, (iv) source shift, (v) generalizability to unknown
classes (zero-shot), and (vi) open-set recognition. To perform this extensive
study, we carefully craft a test bed consisting of 17 in-distribution and
out-of-distribution benchmark pairs using available public datasets and a
series of evaluation protocols to stress-test the different methods under the
intended shifts. Our study uncovers a series of intriguing findings and
interesting behaviors of VSSL methods. For instance, we observe that while
video models generally struggle with context shifts, v-MAE and supervised
learning exhibit more robustness. Moreover, our study shows that v-MAE is a
strong temporal learner, whereas contrastive methods, v-SimCLR and v-MoCo,
exhibit strong performances against viewpoint shifts. When studying the notion
of open-set recognition, we notice a trade-off between closed-set and open-set
recognition performance if the pretrained VSSL encoders are used without
finetuning. We hope that our work will contribute to the development of robust
video representation learning frameworks for various real-world scenarios. The
project page and code are available at: https://pritamqu.github.io/OOD-VSSL.Comment: NeurIPS 2023 Spotligh
Attack is Good Augmentation: Towards Skeleton-Contrastive Representation Learning
Contrastive learning, relying on effective positive and negative sample
pairs, is beneficial to learn informative skeleton representations in
unsupervised skeleton-based action recognition. To achieve these positive and
negative pairs, existing weak/strong data augmentation methods have to randomly
change the appearance of skeletons for indirectly pursuing semantic
perturbations. However, such approaches have two limitations: 1) solely
perturbing appearance cannot well capture the intrinsic semantic information of
skeletons, and 2) randomly perturbation may change the original
positive/negative pairs to soft positive/negative ones. To address the above
dilemma, we start the first attempt to explore an attack-based augmentation
scheme that additionally brings in direct semantic perturbation, for
constructing hard positive pairs and further assisting in constructing hard
negative pairs. In particular, we propose a novel Attack-Augmentation
Mixing-Contrastive learning (AMC) to contrast hard positive features and
hard negative features for learning more robust skeleton representations. In
AMC, Attack-Augmentation (Att-Aug) is designed to collaboratively perform
targeted and untargeted perturbations of skeletons via attack and augmentation
respectively, for generating high-quality hard positive features. Meanwhile,
Positive-Negative Mixer (PNM) is presented to mix hard positive features and
negative features for generating hard negative features, which are adopted for
updating the mixed memory banks. Extensive experiments on three public datasets
demonstrate that AMC is competitive with the state-of-the-art methods
- …