23 research outputs found

    Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

    Full text link
    We address the problem of video representation learning without human-annotated labels. While previous efforts address the problem by designing novel self-supervised tasks using video data, the learned features are merely on a frame-by-frame basis, which are not applicable to many video analytic tasks where spatio-temporal features are prevailing. In this paper we propose a novel self-supervised approach to learn spatio-temporal features for video representation. Inspired by the success of two-stream approaches in video classification, we propose to learn visual features by regressing both motion and appearance statistics along spatial and temporal dimensions, given only the input video data. Specifically, we extract statistical concepts (fast-motion region and the corresponding dominant direction, spatio-temporal color diversity, dominant color, etc.) from simple patterns in both spatial and temporal domains. Unlike prior puzzles that are even hard for humans to solve, the proposed approach is consistent with human inherent visual habits and therefore easy to answer. We conduct extensive experiments with C3D to validate the effectiveness of our proposed approach. The experiments show that our approach can significantly improve the performance of C3D when applied to video classification tasks. Code is available at https://github.com/laura-wang/video_repres_mas.Comment: CVPR 201

    Self-supervised Video Representation Learning by Pace Prediction

    Get PDF
    This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction. It stems from the observation that human visual system is sensitive to video pace, e.g., slow motion, a widely used technique in film making. Specifically, given a video played in natural pace, we randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip. The assumption here is that the network can only succeed in such a pace reasoning task when it understands the underlying video content and learns representative spatio-temporal features. In addition, we further introduce contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar video content. To validate the effectiveness of the proposed method, we conduct extensive experiments on action recognition and video retrieval tasks with several alternative network architectures. Experimental evaluations show that our approach achieves state-of-the-art performance for self-supervised video representation learning across different network architectures and different benchmarks. The code and pre-trained models are available at https://github.com/laura-wang/video-pace.Comment: Correct some typos;Update some cocurent works accepted by ECCV 202

    Advancing Vision Transformers with Group-Mix Attention

    Full text link
    Vision Transformers (ViTs) have been shown to enhance visual recognition through modeling long-range dependencies with multi-head self-attention (MHSA), which is typically formulated as Query-Key-Value computation. However, the attention map generated from the Query and Key captures only token-to-token correlations at one single granularity. In this paper, we argue that self-attention should have a more comprehensive mechanism to capture correlations among tokens and groups (i.e., multiple adjacent tokens) for higher representational capacity. Thereby, we propose Group-Mix Attention (GMA) as an advanced replacement for traditional self-attention, which can simultaneously capture token-to-token, token-to-group, and group-to-group correlations with various group sizes. To this end, GMA splits the Query, Key, and Value into segments uniformly and performs different group aggregations to generate group proxies. The attention map is computed based on the mixtures of tokens and group proxies and used to re-combine the tokens and groups in Value. Based on GMA, we introduce a powerful backbone, namely GroupMixFormer, which achieves state-of-the-art performance in image classification, object detection, and semantic segmentation with fewer parameters than existing models. For instance, GroupMixFormer-L (with 70.3M parameters and 384^2 input) attains 86.2% Top-1 accuracy on ImageNet-1K without external data, while GroupMixFormer-B (with 45.8M parameters) attains 51.2% mIoU on ADE20K

    Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

    Get PDF
    This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts.Comment: Accepted by TPAMI. An extension of our previous work at arXiv:1904.0359

    Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training

    Full text link
    This work aims to improve unsupervised audio-visual pre-training. Inspired by the efficacy of data augmentation in visual contrastive learning, we propose a novel speed co-augmentation method that randomly changes the playback speeds of both audio and video data. Despite its simplicity, the speed co-augmentation method possesses two compelling attributes: (1) it increases the diversity of audio-visual pairs and doubles the size of negative pairs, resulting in a significant enhancement in the learned representations, and (2) it changes the strict correlation between audio-visual pairs but introduces a partial relationship between the augmented pairs, which is modeled by our proposed SoftInfoNCE loss to further boost the performance. Experimental results show that the proposed method significantly improves the learned representations when compared to vanilla audio-visual contrastive learning.Comment: Published at the CVPR 2023 Sight and Sound worksho

    Polar-facing slopes showed stronger greening trend than equatorial-facing slopes in Tibetan plateau grasslands

    Get PDF
    The orientation of slopes in alpine zones creates microclimates, e.g. equatorial-facing slopes (EFSs) are generally drier and warmer than are polar-facing slopes (PFSs). The vegetation growing in these microhabitats responds divergently to climatic warming depending on the slope orientation. We proposed a spatial metric, the greenness asymmetric index (GAI), defined as the ratio between the average normalized difference vegetation index (NDVI) on PFSs and EFSs within a given spatial window, to quantify the asymmetry of greenness across aspects. We calculated GAI for each non-overlapping 3 × 3 km2 (100 × 100 Landsat pixels) grid, and seamlessly mapped it on Tibetan Plateau (TP) grassland using NDVI time series from the Landsat-5, -7 and -8 satellites. PFSs were greener than EFSs (GAI > 1) in warm and dry areas, and EFSs were greener than PFSs (GAI < 1) in cold and wet areas. We also detected a stronger greening trend (0.0040 vs 0.0034 y−1) and a higher sensitivity of NDVI to temperature (0.031 vs 0.026 °C−1) on PFSs than EFSs, leading to a significant positive trend in GAI (0.00065 y−1, P < 0.01) in the TP from 1991 to 2020. Our results suggest that global warming exacerbated the greenness asymmetry associated with the slope orientation: PFSs are more sensitive to warming and have been greening at a faster rate than EFSs. The gradient of EFSs and PFSs provided a “natural laboratory” to study interaction of water and temperature limitations on vegetation growth. Our study is the first to detect the effect of aspect on the greening trend in the TP. Future research needs to clarify the full biotic and abiotic determinants for this spatial and temporal asymmetry of greenness across aspects with the support of extensive field measurements and refined high-resolution NDVI products.This study was funded by the National Natural Science Foundation of China 42271323 and 41971282, the Sichuan Science and Technology Program 2021JDJQ0007, the Spanish Government project TED2021-132627B-I00 funded by the Spanish MCIN, AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR, the Fundación Ramón Areces project CIVP20A6621 and the Catalan government project SGR2021-1333.N

    Self-supervised Video Representation Learning by Pace Prediction

    No full text

    View-Invariant Human Action Recognition Based on a 3D Bio-Constrained Skeleton Model

    No full text
    corecore