9 research outputs found
Learning Video Representations from Correspondence Proposals
Correspondences between frames encode rich information about dynamic content
in videos. However, it is challenging to effectively capture and learn those
due to their irregular structure and complex dynamics. In this paper, we
propose a novel neural network that learns video representations by aggregating
information from potential correspondences. This network, named , can
learn evolving 2D fields with temporal consistency. In particular, it can
effectively learn representations for videos by mixing appearance and
long-range motion with an RGB-only input. We provide extensive ablation
experiments to validate our model. CPNet shows stronger performance than
existing methods on Kinetics and achieves the state-of-the-art performance on
Something-Something and Jester. We provide analysis towards the behavior of our
model and show its robustness to errors in proposals.Comment: CVPR 2019 (Oral
Adaptive Interaction Modeling via Graph Operations Search
Interaction modeling is important for video action analysis. Recently,
several works design specific structures to model interactions in videos.
However, their structures are manually designed and non-adaptive, which require
structures design efforts and more importantly could not model interactions
adaptively. In this paper, we automate the process of structures design to
learn adaptive structures for interaction modeling. We propose to search the
network structures with differentiable architecture search mechanism, which
learns to construct adaptive structures for different videos to facilitate
adaptive interaction modeling. To this end, we first design the search space
with several basic graph operations that explicitly capture different relations
in videos. We experimentally demonstrate that our architecture search framework
learns to construct adaptive interaction modeling structures, which provides
more understanding about the relations between the structures and some
interaction characteristics, and also releases the requirement of structures
design efforts. Additionally, we show that the designed basic graph operations
in the search space are able to model different interactions in videos. The
experiments on two interaction datasets show that our method achieves
competitive performance with state-of-the-arts
V4D:4D Convolutional Neural Networks for Video-level Representation Learning
Most existing 3D CNNs for video representation learning are clip-based
methods, and thus do not consider video-level temporal evolution of
spatio-temporal features. In this paper, we propose Video-level 4D
Convolutional Neural Networks, referred as V4D, to model the evolution of
long-range spatio-temporal representation with 4D convolutions, and at the same
time, to preserve strong 3D spatio-temporal representation with residual
connections. Specifically, we design a new 4D residual block able to capture
inter-clip interactions, which could enhance the representation power of the
original clip-level 3D CNNs. The 4D residual blocks can be easily integrated
into the existing 3D CNNs to perform long-range modeling hierarchically. We
further introduce the training and inference methods for the proposed V4D.
Extensive experiments are conducted on three video recognition benchmarks,
where V4D achieves excellent results, surpassing recent 3D CNNs by a large
margin.Comment: To appear in ICLR202
Learning Efficient Video Representation with Video Shuffle Networks
3D CNN shows its strong ability in learning spatiotemporal representation in
recent video recognition tasks. However, inflating 2D convolution to 3D
inevitably introduces additional computational costs, making it cumbersome in
practical deployment. We consider whether there is a way to equip the
conventional 2D convolution with temporal vision no requiring expanding its
kernel. To this end, we propose the video shuffle, a parameter-free plug-in
component that efficiently reallocates the inputs of 2D convolution so that its
receptive field can be extended to the temporal dimension. In practical, video
shuffle firstly divides each frame feature into multiple groups and then
aggregate the grouped features via temporal shuffle operation. This allows the
following 2D convolution aggregate the global spatiotemporal features. The
proposed video shuffle can be flexibly inserted into popular 2D CNNs, forming
the Video Shuffle Networks (VSN). With a simple yet efficient implementation,
VSN performs surprisingly well on temporal modeling benchmarks. In experiments,
VSN not only gains non-trivial improvements on Kinetics and Moments in Time,
but also achieves state-of-the-art performance on Something-Something-V1,
Something-Something-V2 datasets
SmallBigNet: Integrating Core and Contextual Views for Video Classification
Temporal convolution has been widely used for video classification. However,
it is performed on spatio-temporal contexts in a limited view, which often
weakens its capacity of learning video representation. To alleviate this
problem, we propose a concise and novel SmallBig network, with the cooperation
of small and big views. For the current time step, the small view branch is
used to learn the core semantics, while the big view branch is used to capture
the contextual semantics. Unlike traditional temporal convolution, the big view
branch can provide the small view branch with the most activated video features
from a broader 3D receptive field. Via aggregating such big-view contexts, the
small view branch can learn more robust and discriminative spatio-temporal
representations for video classification. Furthermore, we propose to share
convolution in the small and big view branch, which improves model compactness
as well as alleviates overfitting. As a result, our SmallBigNet achieves a
comparable model size like 2D CNNs, while boosting accuracy like 3D CNNs. We
conduct extensive experiments on the large-scale video benchmarks, e.g.,
Kinetics400, Something-Something V1 and V2. Our SmallBig network outperforms a
number of recent state-of-the-art approaches, in terms of accuracy and/or
efficiency. The codes and models will be available on
https://github.com/xhl-video/SmallBigNet.Comment: CVPR202
G-TAD: Sub-Graph Localization for Temporal Action Detection
Temporal action detection is a fundamental yet challenging task in video
understanding. Video context is a critical cue to effectively detect actions,
but current works mainly focus on temporal context, while neglecting semantic
context as well as other important context properties. In this work, we propose
a graph convolutional network (GCN) model to adaptively incorporate multi-level
semantic context into video features and cast temporal action detection as a
sub-graph localization problem. Specifically, we formulate video snippets as
graph nodes, snippet-snippet correlations as edges, and actions associated with
context as target sub-graphs. With graph convolution as the basic operation, we
design a GCN block called GCNeXt, which learns the features of each node by
aggregating its context and dynamically updates the edges in the graph. To
localize each sub-graph, we also design an SGAlign layer to embed each
sub-graph into the Euclidean space. Extensive experiments show that G-TAD is
capable of finding effective video context without extra supervision and
achieves state-of-the-art performance on two detection benchmarks. On
ActivityNet-1.3, it obtains an average mAP of 34.09%; on THUMOS14, it reaches
51.6% at [email protected] when combined with a proposal processing method. G-TAD code is
publicly available at https://github.com/frostinassiky/gtad.Comment: Accepted by CVPR2020. 8 pages, 9 figures, 2 pages appendi
AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification
Convolutional operations have two limitations: (1) do not explicitly model
where to focus as the same filter is applied to all the positions, and (2) are
unsuitable for modeling long-range dependencies as they only operate on a small
neighborhood. While both limitations can be alleviated by attention operations,
many design choices remain to be determined to use attention, especially when
applying attention to videos. Towards a principled way of applying attention to
videos, we address the task of spatiotemporal attention cell search. We propose
a novel search space for spatiotemporal attention cells, which allows the
search algorithm to flexibly explore various design choices in the cell. The
discovered attention cells can be seamlessly inserted into existing backbone
networks, e.g., I3D or S3D, and improve video classification accuracy by more
than 2% on both Kinetics-600 and MiT datasets. The discovered attention cells
outperform non-local blocks on both datasets, and demonstrate strong
generalization across different modalities, backbones, and datasets. Inserting
our attention cells into I3D-R50 yields state-of-the-art performance on both
datasets.Comment: ECCV 202
Causal Contextual Prediction for Learned Image Compression
Over the past several years, we have witnessed impressive progress in the
field of learned image compression. Recent learned image codecs are commonly
based on autoencoders, that first encode an image into low-dimensional latent
representations and then decode them for reconstruction purposes. To capture
spatial dependencies in the latent space, prior works exploit hyperprior and
spatial context model to build an entropy model, which estimates the bit-rate
for end-to-end rate-distortion optimization. However, such an entropy model is
suboptimal from two aspects: (1) It fails to capture spatially global
correlations among the latents. (2) Cross-channel relationships of the latents
are still underexplored. In this paper, we propose the concept of separate
entropy coding to leverage a serial decoding process for causal contextual
entropy prediction in the latent space. A causal context model is proposed that
separates the latents across channels and makes use of cross-channel
relationships to generate highly informative contexts. Furthermore, we propose
a causal global prediction model, which is able to find global reference points
for accurate predictions of unknown points. Both these two models facilitate
entropy estimation without the transmission of overhead. In addition, we
further adopt a new separate attention module to build more powerful transform
networks. Experimental results demonstrate that our full image compression
model outperforms standard VVC/H.266 codec on Kodak dataset in terms of both
PSNR and MS-SSIM, yielding the state-of-the-art rate-distortion performance.Comment: We add some descriptions for the improved quantization in the latest
arxiv versio
Recent Progress in Appearance-based Action Recognition
Action recognition, which is formulated as a task to identify various human
actions in a video, has attracted increasing interest from computer vision
researchers due to its importance in various applications. Recently,
appearance-based methods have achieved promising progress towards accurate
action recognition. In general, these methods mainly fulfill the task by
applying various schemes to model spatial and temporal visual information
effectively. To better understand the current progress of appearance-based
action recognition, we provide a comprehensive review of recent achievements in
this area. In particular, we summarise and discuss several dozens of related
research papers, which can be roughly divided into four categories according to
different appearance modelling strategies. The obtained categories include 2D
convolutional methods, 3D convolutional methods, motion representation-based
methods, and context representation-based methods. We analyse and discuss
representative methods from each category, comprehensively. Empirical results
are also summarised to better illustrate cutting-edge algorithms. We conclude
by identifying important areas for future research gleaned from our
categorisation