23 research outputs found
Learning Spatio-Temporal Representation with Local and Global Diffusion
Convolutional Neural Networks (CNN) have been regarded as a powerful class of
models for visual recognition problems. Nevertheless, the convolutional filters
in these networks are local operations while ignoring the large-range
dependency. Such drawback becomes even worse particularly for video
recognition, since video is an information-intensive media with complex
temporal variations. In this paper, we present a novel framework to boost the
spatio-temporal representation learning by Local and Global Diffusion (LGD).
Specifically, we construct a novel neural network architecture that learns the
local and global representations in parallel. The architecture is composed of
LGD blocks, where each block updates local and global features by modeling the
diffusions between these two representations. Diffusions effectively interact
two aspects of information, i.e., localized and holistic, for more powerful way
of representation learning. Furthermore, a kernelized classifier is introduced
to combine the representations from two aspects for video recognition. Our LGD
networks achieve clear improvements on the large-scale Kinetics-400 and
Kinetics-600 video classification datasets against the best competitors by 3.5%
and 0.7%. We further examine the generalization of both the global and local
representations produced by our pre-trained LGD networks on four different
benchmarks for video action recognition and spatio-temporal action detection
tasks. Superior performances over several state-of-the-art techniques on these
benchmarks are reported. Code is available at:
https://github.com/ZhaofanQiu/local-and-global-diffusion-networks.Comment: CVPR 201
ConViViT -- A Deep Neural Network Combining Convolutions and Factorized Self-Attention for Human Activity Recognition
The Transformer architecture has gained significant popularity in computer
vision tasks due to its capacity to generalize and capture long-range
dependencies. This characteristic makes it well-suited for generating
spatiotemporal tokens from videos. On the other hand, convolutions serve as the
fundamental backbone for processing images and videos, as they efficiently
aggregate information within small local neighborhoods to create spatial tokens
that describe the spatial dimension of a video. While both CNN-based
architectures and pure transformer architectures are extensively studied and
utilized by researchers, the effective combination of these two backbones has
not received comparable attention in the field of activity recognition. In this
research, we propose a novel approach that leverages the strengths of both CNNs
and Transformers in an hybrid architecture for performing activity recognition
using RGB videos. Specifically, we suggest employing a CNN network to enhance
the video representation by generating a 128-channel video that effectively
separates the human performing the activity from the background. Subsequently,
the output of the CNN module is fed into a transformer to extract
spatiotemporal tokens, which are then used for classification purposes. Our
architecture has achieved new SOTA results with 90.05 \%, 99.6\%, and 95.09\%
on HMDB51, UCF101, and ETRI-Activity3D respectively
Human Action Recognition Using Multi-Stream Fusion and Hybrid Deep Neural Networks
Action Recognition in videos is a topic of interest in the area of computer vision, due to potential applications such as multimedia indexing and surveillance in public areas. In this research, we first propose spatial and temporal Convolutional Neural Network (CNNs), based on transfer learning using ResNet101, GoogleNet and VGG16, for undertaking human action recognition. Besides that, hybrid networks such as CNNRecurrent Neural Network (RNN) models are also exploited as encoder-decoder architectures for video action classification. In particular, different types of RNNs such as Long Short-Term Memory (LSTM), Bidirectional-LSTM (BiLSTM), Gated Recurrent Unit (GRU), and Bidirectional-GRU (BiGRU), are exploited as the decoders for action recognition. To further enhance performance, diverse aggregation networks of CNN and CNN-RNN models are implemented. Specifically, an Average Fusion method is used to integrate spatial and temporal CNNs trained on images, as well as CNN-RNN trained on videos, where the final classification is formed by combining Softmax scores of these models via a late fusion. A total of 22 models (1 motion CNN, 3 spatial CNNs, 12 CNN-RNNs and 6 fusion networks) are implemented which are evaluated using UCF11, UCF50, and UCF101 datasets for performance comparison. The empirical results indicate the significant efficiency of Average Fusion of multiple Spatial-CNNs with one Motion-CNN, and ResNet101-BiGRU, among all the networks for undertaking realistic video action recognition
Learn to cycle: Time-consistent feature discovery for action recognition
Generalizing over temporal variations is a prerequisite for effective action
recognition in videos. Despite significant advances in deep neural networks, it
remains a challenge to focus on short-term discriminative motions in relation
to the overall performance of an action. We address this challenge by allowing
some flexibility in discovering relevant spatio-temporal features. We introduce
Squeeze and Recursion Temporal Gates (SRTG), an approach that favors inputs
with similar activations with potential temporal variations. We implement this
idea with a novel CNN block that uses an LSTM to encapsulate feature dynamics,
in conjunction with a temporal gate that is responsible for evaluating the
consistency of the discovered dynamics and the modeled features. We show
consistent improvement when using SRTG blocks, with only a minimal increase in
the number of GFLOPs. On Kinetics-700, we perform on par with current
state-of-the-art models, and outperform these on HACS, Moments in Time, UCF-101
and HMDB-51
SeCo: Exploring Sequence Supervision for Unsupervised Representation Learning
A steady momentum of innovations and breakthroughs has convincingly pushed
the limits of unsupervised image representation learning. Compared to static 2D
images, video has one more dimension (time). The inherent supervision existing
in such sequential structure offers a fertile ground for building unsupervised
learning models. In this paper, we compose a trilogy of exploring the basic and
generic supervision in the sequence from spatial, spatiotemporal and sequential
perspectives. We materialize the supervisory signals through determining
whether a pair of samples is from one frame or from one video, and whether a
triplet of samples is in the correct temporal order. We uniquely regard the
signals as the foundation in contrastive learning and derive a particular form
named Sequence Contrastive Learning (SeCo). SeCo shows superior results under
the linear protocol on action recognition (Kinetics), untrimmed activity
recognition (ActivityNet) and object tracking (OTB-100). More remarkably, SeCo
demonstrates considerable improvements over recent unsupervised pre-training
techniques, and leads the accuracy by 2.96% and 6.47% against fully-supervised
ImageNet pre-training in action recognition task on UCF101 and HMDB51,
respectively. Source code is available at
\url{https://github.com/YihengZhang-CV/SeCo-Sequence-Contrastive-Learning}.Comment: AAAI 2021; Code is publicly available at:
https://github.com/YihengZhang-CV/SeCo-Sequence-Contrastive-Learnin
Multi-Temporal Convolutions for Human Action Recognition in Videos
Effective extraction of temporal patterns is crucial for the recognition of
temporally varying actions in video. We argue that the fixed-sized
spatio-temporal convolution kernels used in convolutional neural networks
(CNNs) can be improved to extract informative motions that are executed at
different time scales. To address this challenge, we present a novel
spatio-temporal convolution block that is capable of extracting spatio-temporal
patterns at multiple temporal resolutions. Our proposed multi-temporal
convolution (MTConv) blocks utilize two branches that focus on brief and
prolonged spatio-temporal patterns, respectively. The extracted time-varying
features are aligned in a third branch, with respect to global motion patterns
through recurrent cells. The proposed blocks are lightweight and can be
integrated into any 3D-CNN architecture. This introduces a substantial
reduction in computational costs. Extensive experiments on Kinetics, Moments in
Time and HACS action recognition benchmark datasets demonstrate competitive
performance of MTConvs compared to the state-of-the-art with a significantly
lower computational footprint