23 research outputs found
Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics
We address the problem of video representation learning without
human-annotated labels. While previous efforts address the problem by designing
novel self-supervised tasks using video data, the learned features are merely
on a frame-by-frame basis, which are not applicable to many video analytic
tasks where spatio-temporal features are prevailing. In this paper we propose a
novel self-supervised approach to learn spatio-temporal features for video
representation. Inspired by the success of two-stream approaches in video
classification, we propose to learn visual features by regressing both motion
and appearance statistics along spatial and temporal dimensions, given only the
input video data. Specifically, we extract statistical concepts (fast-motion
region and the corresponding dominant direction, spatio-temporal color
diversity, dominant color, etc.) from simple patterns in both spatial and
temporal domains. Unlike prior puzzles that are even hard for humans to solve,
the proposed approach is consistent with human inherent visual habits and
therefore easy to answer. We conduct extensive experiments with C3D to validate
the effectiveness of our proposed approach. The experiments show that our
approach can significantly improve the performance of C3D when applied to video
classification tasks. Code is available at
https://github.com/laura-wang/video_repres_mas.Comment: CVPR 201
Self-supervised Video Representation Learning by Pace Prediction
This paper addresses the problem of self-supervised video representation
learning from a new perspective -- by video pace prediction. It stems from the
observation that human visual system is sensitive to video pace, e.g., slow
motion, a widely used technique in film making. Specifically, given a video
played in natural pace, we randomly sample training clips in different paces
and ask a neural network to identify the pace for each video clip. The
assumption here is that the network can only succeed in such a pace reasoning
task when it understands the underlying video content and learns representative
spatio-temporal features. In addition, we further introduce contrastive
learning to push the model towards discriminating different paces by maximizing
the agreement on similar video content. To validate the effectiveness of the
proposed method, we conduct extensive experiments on action recognition and
video retrieval tasks with several alternative network architectures.
Experimental evaluations show that our approach achieves state-of-the-art
performance for self-supervised video representation learning across different
network architectures and different benchmarks. The code and pre-trained models
are available at https://github.com/laura-wang/video-pace.Comment: Correct some typos;Update some cocurent works accepted by ECCV 202
Advancing Vision Transformers with Group-Mix Attention
Vision Transformers (ViTs) have been shown to enhance visual recognition
through modeling long-range dependencies with multi-head self-attention (MHSA),
which is typically formulated as Query-Key-Value computation. However, the
attention map generated from the Query and Key captures only token-to-token
correlations at one single granularity. In this paper, we argue that
self-attention should have a more comprehensive mechanism to capture
correlations among tokens and groups (i.e., multiple adjacent tokens) for
higher representational capacity. Thereby, we propose Group-Mix Attention (GMA)
as an advanced replacement for traditional self-attention, which can
simultaneously capture token-to-token, token-to-group, and group-to-group
correlations with various group sizes. To this end, GMA splits the Query, Key,
and Value into segments uniformly and performs different group aggregations to
generate group proxies. The attention map is computed based on the mixtures of
tokens and group proxies and used to re-combine the tokens and groups in Value.
Based on GMA, we introduce a powerful backbone, namely GroupMixFormer, which
achieves state-of-the-art performance in image classification, object
detection, and semantic segmentation with fewer parameters than existing
models. For instance, GroupMixFormer-L (with 70.3M parameters and 384^2 input)
attains 86.2% Top-1 accuracy on ImageNet-1K without external data, while
GroupMixFormer-B (with 45.8M parameters) attains 51.2% mIoU on ADE20K
Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics
This paper proposes a novel pretext task to address the self-supervised video
representation learning problem. Specifically, given an unlabeled video clip,
we compute a series of spatio-temporal statistical summaries, such as the
spatial location and dominant direction of the largest motion, the spatial
location and dominant color of the largest color diversity along the temporal
axis, etc. Then a neural network is built and trained to yield the statistical
summaries given the video frames as inputs. In order to alleviate the learning
difficulty, we employ several spatial partitioning patterns to encode rough
spatial locations instead of exact spatial Cartesian coordinates. Our approach
is inspired by the observation that human visual system is sensitive to rapidly
changing contents in the visual field, and only needs impressions about rough
spatial locations to understand the visual contents. To validate the
effectiveness of the proposed approach, we conduct extensive experiments with
four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results
show that our approach outperforms the existing approaches across these
backbone networks on four downstream video analysis tasks including action
recognition, video retrieval, dynamic scene recognition, and action similarity
labeling. The source code is publicly available at:
https://github.com/laura-wang/video_repres_sts.Comment: Accepted by TPAMI. An extension of our previous work at
arXiv:1904.0359
Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training
This work aims to improve unsupervised audio-visual pre-training. Inspired by
the efficacy of data augmentation in visual contrastive learning, we propose a
novel speed co-augmentation method that randomly changes the playback speeds of
both audio and video data. Despite its simplicity, the speed co-augmentation
method possesses two compelling attributes: (1) it increases the diversity of
audio-visual pairs and doubles the size of negative pairs, resulting in a
significant enhancement in the learned representations, and (2) it changes the
strict correlation between audio-visual pairs but introduces a partial
relationship between the augmented pairs, which is modeled by our proposed
SoftInfoNCE loss to further boost the performance. Experimental results show
that the proposed method significantly improves the learned representations
when compared to vanilla audio-visual contrastive learning.Comment: Published at the CVPR 2023 Sight and Sound worksho
Polar-facing slopes showed stronger greening trend than equatorial-facing slopes in Tibetan plateau grasslands
The orientation of slopes in alpine zones creates microclimates, e.g. equatorial-facing slopes (EFSs) are generally drier and warmer than are polar-facing slopes (PFSs). The vegetation growing in these microhabitats responds divergently to climatic warming depending on the slope orientation. We proposed a spatial metric, the greenness asymmetric index (GAI), defined as the ratio between the average normalized difference vegetation index (NDVI) on PFSs and EFSs within a given spatial window, to quantify the asymmetry of greenness across aspects. We calculated GAI for each non-overlapping 3 × 3 km2 (100 × 100 Landsat pixels) grid, and seamlessly mapped it on Tibetan Plateau (TP) grassland using NDVI time series from the Landsat-5, -7 and -8 satellites. PFSs were greener than EFSs (GAI > 1) in warm and dry areas, and EFSs were greener than PFSs (GAI < 1) in cold and wet areas. We also detected a stronger greening trend (0.0040 vs 0.0034 y−1) and a higher sensitivity of NDVI to temperature (0.031 vs 0.026 °C−1) on PFSs than EFSs, leading to a significant positive trend in GAI (0.00065 y−1, P < 0.01) in the TP from 1991 to 2020. Our results suggest that global warming exacerbated the greenness asymmetry associated with the slope orientation: PFSs are more sensitive to warming and have been greening at a faster rate than EFSs. The gradient of EFSs and PFSs provided a “natural laboratory” to study interaction of water and temperature limitations on vegetation growth. Our study is the first to detect the effect of aspect on the greening trend in the TP. Future research needs to clarify the full biotic and abiotic determinants for this spatial and temporal asymmetry of greenness across aspects with the support of extensive field measurements and refined high-resolution NDVI products.This study was funded by the National Natural Science Foundation of China 42271323 and 41971282, the Sichuan Science and Technology Program 2021JDJQ0007, the Spanish Government project TED2021-132627B-I00 funded by the Spanish MCIN, AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR, the Fundación Ramón Areces project CIVP20A6621 and the Catalan government project SGR2021-1333.N