4,506 research outputs found

    View-Independent Action Recognition from Temporal Self-Similarities

    Full text link

    Video-efficient foundation models

    Get PDF
    The thesis strives to endow video-efficiency in video understanding by addressing the research question ''What enables video-efficient video foundation models?'' Video-efficiency encompasses developing video foundation models that are not only accurate but also exhibit label-efficiency i.e. require fewer labels, domain-efficiency i.e. applicable to a variety of video learning scenarios, and data-efficiency i.e. reduce the amount of video data needed for learning. The research question is addressed for RGB and non-RGB video modalities. In Chapter 2, we focus on improving the label- and domain-efficiency of non-RGB action recognition and detection. Chapter 3 introduces a new self-supervised approach for learning feature representations for 3D-skeleton video sequences. In Chapter 4, we conduct a large-scale study of existing RGB-based self-supervised video models to assess their performance across different facets of video-efficiency. Chapter 5 presents a new method for video self-supervision that explicitly aims to learn motion focused video-representations. To summarize, this thesis presents several novel approaches to improve the video-efficiency of video foundation models. Our research highlights the importance of transferring knowledge between RGB and non-RGB video modalities, exploring self-supervision for non-RGB video modeling, analyzing self-supervised models beyond canonical setups and carefully designing new self-supervised tasks to develop video foundation models that can exhibit different facets of video-efficiency. We hope that our work will inspire further research and development in this area, leading to even more video-efficient foundation models

    Uncovering the Hidden Dynamics of Video Self-supervised Learning under Distribution Shifts

    Full text link
    Video self-supervised learning (VSSL) has made significant progress in recent years. However, the exact behavior and dynamics of these models under different forms of distribution shift are not yet known. In this paper, we comprehensively study the behavior of six popular self-supervised methods (v-SimCLR, v-MoCo, v-BYOL, v-SimSiam, v-DINO, v-MAE) in response to various forms of natural distribution shift, i.e., (i) context shift, (ii) viewpoint shift, (iii) actor shift, (iv) source shift, (v) generalizability to unknown classes (zero-shot), and (vi) open-set recognition. To perform this extensive study, we carefully craft a test bed consisting of 17 in-distribution and out-of-distribution benchmark pairs using available public datasets and a series of evaluation protocols to stress-test the different methods under the intended shifts. Our study uncovers a series of intriguing findings and interesting behaviors of VSSL methods. For instance, we observe that while video models generally struggle with context shifts, v-MAE and supervised learning exhibit more robustness. Moreover, our study shows that v-MAE is a strong temporal learner, whereas contrastive methods, v-SimCLR and v-MoCo, exhibit strong performances against viewpoint shifts. When studying the notion of open-set recognition, we notice a trade-off between closed-set and open-set recognition performance if the pretrained VSSL encoders are used without finetuning. We hope that our work will contribute to the development of robust video representation learning frameworks for various real-world scenarios. The project page and code are available at: https://pritamqu.github.io/OOD-VSSL.Comment: NeurIPS 2023 Spotligh

    Attack is Good Augmentation: Towards Skeleton-Contrastive Representation Learning

    Full text link
    Contrastive learning, relying on effective positive and negative sample pairs, is beneficial to learn informative skeleton representations in unsupervised skeleton-based action recognition. To achieve these positive and negative pairs, existing weak/strong data augmentation methods have to randomly change the appearance of skeletons for indirectly pursuing semantic perturbations. However, such approaches have two limitations: 1) solely perturbing appearance cannot well capture the intrinsic semantic information of skeletons, and 2) randomly perturbation may change the original positive/negative pairs to soft positive/negative ones. To address the above dilemma, we start the first attempt to explore an attack-based augmentation scheme that additionally brings in direct semantic perturbation, for constructing hard positive pairs and further assisting in constructing hard negative pairs. In particular, we propose a novel Attack-Augmentation Mixing-Contrastive learning (A2^2MC) to contrast hard positive features and hard negative features for learning more robust skeleton representations. In A2^2MC, Attack-Augmentation (Att-Aug) is designed to collaboratively perform targeted and untargeted perturbations of skeletons via attack and augmentation respectively, for generating high-quality hard positive features. Meanwhile, Positive-Negative Mixer (PNM) is presented to mix hard positive features and negative features for generating hard negative features, which are adopted for updating the mixed memory banks. Extensive experiments on three public datasets demonstrate that A2^2MC is competitive with the state-of-the-art methods
    • …
    corecore