77 research outputs found
Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video
We propose a self-supervised visual learning method by predicting the
variable playback speeds of a video. Without semantic labels, we learn the
spatio-temporal visual representation of the video by leveraging the variations
in the visual appearance according to different playback speeds under the
assumption of temporal coherence. To learn the spatio-temporal visual
variations in the entire video, we have not only predicted a single playback
speed but also generated clips of various playback speeds and directions with
randomized starting points. Hence the visual representation can be successfully
learned from the meta information (playback speeds and directions) of the
video. We also propose a new layer dependable temporal group normalization
method that can be applied to 3D convolutional networks to improve the
representation learning performance where we divide the temporal features into
several groups and normalize each one using the different corresponding
parameters. We validate the effectiveness of our method by fine-tuning it to
the action recognition and video retrieval tasks on UCF-101 and HMDB-51.Comment: Accepted by IEEE Access on May 19, 202
RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning
We study unsupervised video representation learning that seeks to learn both
motion and appearance features from unlabeled video only, which can be reused
for downstream tasks such as action recognition. This task, however, is
extremely challenging due to 1) the highly complex spatial-temporal information
in videos; and 2) the lack of labeled data for training. Unlike the
representation learning for static images, it is difficult to construct a
suitable self-supervised task to well model both motion and appearance
features. More recently, several attempts have been made to learn video
representation through video playback speed prediction. However, it is
non-trivial to obtain precise speed labels for the videos. More critically, the
learnt models may tend to focus on motion pattern and thus may not learn
appearance features well. In this paper, we observe that the relative playback
speed is more consistent with motion pattern, and thus provide more effective
and stable supervision for representation learning. Therefore, we propose a new
way to perceive the playback speed and exploit the relative speed between two
video clips as labels. In this way, we are able to well perceive speed and
learn better motion features. Moreover, to ensure the learning of appearance
features, we further propose an appearance-focused task, where we enforce the
model to perceive the appearance difference between two video clips. We show
that optimizing the two tasks jointly consistently improves the performance on
two downstream tasks, namely action recognition and video retrieval.
Remarkably, for action recognition on UCF101 dataset, we achieve 93.7% accuracy
without the use of labeled data for pre-training, which outperforms the
ImageNet supervised pre-trained model. Code and pre-trained models can be found
at https://github.com/PeihaoChen/RSPNet.Comment: Accepted by AAAI-2021. Code and pre-trained models can be found at
https://github.com/PeihaoChen/RSPNe
Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning
As the most essential property in a video, motion information is critical to
a robust and generalized video representation. To inject motion dynamics,
recent works have adopted frame difference as the source of motion information
in video contrastive learning, considering the trade-off between quality and
cost. However, existing works align motion features at the instance level,
which suffers from spatial and temporal weak alignment across modalities. In
this paper, we present a \textbf{Fi}ne-grained \textbf{M}otion
\textbf{A}lignment (FIMA) framework, capable of introducing well-aligned and
significant motion information. Specifically, we first develop a dense
contrastive learning framework in the spatiotemporal domain to generate
pixel-level motion supervision. Then, we design a motion decoder and a
foreground sampling strategy to eliminate the weak alignments in terms of time
and space. Moreover, a frame-level motion contrastive loss is presented to
improve the temporal diversity of the motion features. Extensive experiments
demonstrate that the representations learned by FIMA possess great
motion-awareness capabilities and achieve state-of-the-art or competitive
results on downstream tasks across UCF101, HMDB51, and Diving48 datasets. Code
is available at \url{https://github.com/ZMHH-H/FIMA}.Comment: ACM MM 2023 Camera Read
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Video question answering (VideoQA) is a complex task that requires diverse
multi-modal data for training. Manual annotation of question and answers for
videos, however, is tedious and prohibits scalability. To tackle this problem,
recent methods consider zero-shot settings with no manual annotation of visual
question-answer. In particular, a promising approach adapts frozen
autoregressive language models pretrained on Web-scale text-only data to
multi-modal inputs. In contrast, we here build on frozen bidirectional language
models (BiLM) and show that such an approach provides a stronger and cheaper
alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs
with the frozen BiLM using light trainable modules, (ii) we train such modules
using Web-scraped multi-modal data, and finally (iii) we perform zero-shot
VideoQA inference through masked language modeling, where the masked text is
the answer to a given question. Our proposed approach, FrozenBiLM, outperforms
the state of the art in zero-shot VideoQA by a significant margin on a variety
of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA,
TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in
the few-shot and fully-supervised setting. Our code and models are publicly
available at https://github.com/antoyang/FrozenBiLM.Comment: NeurIPS 2022 Camera-Ready; Project Webpage:
https://antoyang.github.io/frozenbilm.html; 25 pages; 5 figure
Dual Contrastive Learning for Spatio-temporal Representation
Contrastive learning has shown promising potential in self-supervised
spatio-temporal representation learning. Most works naively sample different
clips to construct positive and negative pairs. However, we observe that this
formulation inclines the model towards the background scene bias. The
underlying reasons are twofold. First, the scene difference is usually more
noticeable and easier to discriminate than the motion difference. Second, the
clips sampled from the same video often share similar backgrounds but have
distinct motions. Simply regarding them as positive pairs will draw the model
to the static background rather than the motion pattern. To tackle this
challenge, this paper presents a novel dual contrastive formulation.
Concretely, we decouple the input RGB video sequence into two complementary
modes, static scene and dynamic motion. Then, the original RGB features are
pulled closer to the static features and the aligned dynamic features,
respectively. In this way, the static scene and the dynamic motion are
simultaneously encoded into the compact RGB representation. We further conduct
the feature space decoupling via activation maps to distill static- and
dynamic-related features. We term our method as \textbf{D}ual
\textbf{C}ontrastive \textbf{L}earning for spatio-temporal
\textbf{R}epresentation (DCLR). Extensive experiments demonstrate that DCLR
learns effective spatio-temporal representations and obtains state-of-the-art
or comparable performance on UCF-101, HMDB-51, and Diving-48 datasets.Comment: ACM MM 2022 camera read
Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework
We propose a self-supervised method to learn feature representations from
videos. A standard approach in traditional self-supervised methods uses
positive-negative data pairs to train with contrastive learning strategy. In
such a case, different modalities of the same video are treated as positives
and video clips from a different video are treated as negatives. Because the
spatio-temporal information is important for video representation, we extend
the negative samples by introducing intra-negative samples, which are
transformed from the same anchor video by breaking temporal relations in video
clips. With the proposed Inter-Intra Contrastive (IIC) framework, we can train
spatio-temporal convolutional networks to learn video representations. There
are many flexible options in our IIC framework and we conduct experiments by
using several different configurations. Evaluations are conducted on video
retrieval and video recognition tasks using the learned video representation.
Our proposed IIC outperforms current state-of-the-art results by a large
margin, such as 16.7% and 9.5% points improvements in top-1 accuracy on UCF101
and HMDB51 datasets for video retrieval, respectively. For video recognition,
improvements can also be obtained on these two benchmark datasets. Code is
available at
https://github.com/BestJuly/Inter-intra-video-contrastive-learning.Comment: Accepted by ACMMM 2020. Our project page is at
https://bestjuly.github.io/Inter-intra-video-contrastive-learning
Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics
This paper proposes a novel pretext task to address the self-supervised video
representation learning problem. Specifically, given an unlabeled video clip,
we compute a series of spatio-temporal statistical summaries, such as the
spatial location and dominant direction of the largest motion, the spatial
location and dominant color of the largest color diversity along the temporal
axis, etc. Then a neural network is built and trained to yield the statistical
summaries given the video frames as inputs. In order to alleviate the learning
difficulty, we employ several spatial partitioning patterns to encode rough
spatial locations instead of exact spatial Cartesian coordinates. Our approach
is inspired by the observation that human visual system is sensitive to rapidly
changing contents in the visual field, and only needs impressions about rough
spatial locations to understand the visual contents. To validate the
effectiveness of the proposed approach, we conduct extensive experiments with
four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results
show that our approach outperforms the existing approaches across these
backbone networks on four downstream video analysis tasks including action
recognition, video retrieval, dynamic scene recognition, and action similarity
labeling. The source code is publicly available at:
https://github.com/laura-wang/video_repres_sts.Comment: Accepted by TPAMI. An extension of our previous work at
arXiv:1904.0359
- …