78,274 research outputs found
VideoGraph: Recognizing Minutes-Long Human Activities in Videos
Many human activities take minutes to unfold. To represent them, related
works opt for statistical pooling, which neglects the temporal structure.
Others opt for convolutional methods, as CNN and Non-Local. While successful in
learning temporal concepts, they are short of modeling minutes-long temporal
dependencies. We propose VideoGraph, a method to achieve the best of two
worlds: represent minutes-long human activities and learn their underlying
temporal structure. VideoGraph learns a graph-based representation for human
activities. The graph, its nodes and edges are learned entirely from video
datasets, making VideoGraph applicable to problems without node-level
annotation. The result is improvements over related works on benchmarks:
Epic-Kitchen and Breakfast. Besides, we demonstrate that VideoGraph is able to
learn the temporal structure of human activities in minutes-long videos
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
Despite the steady progress in video analysis led by the adoption of
convolutional neural networks (CNNs), the relative improvement has been less
drastic as that in 2D static image classification. Three main challenges exist
including spatial (image) feature representation, temporal information
representation, and model/computation complexity. It was recently shown by
Carreira and Zisserman that 3D CNNs, inflated from 2D networks and pretrained
on ImageNet, could be a promising way for spatial and temporal representation
learning. However, as for model/computation complexity, 3D CNNs are much more
expensive than 2D CNNs and prone to overfit. We seek a balance between speed
and accuracy by building an effective and efficient video classification system
through systematic exploration of critical network design choices. In
particular, we show that it is possible to replace many of the 3D convolutions
by low-cost 2D convolutions. Rather surprisingly, best result (in both speed
and accuracy) is achieved when replacing the 3D convolutions at the bottom of
the network, suggesting that temporal representation learning on high-level
semantic features is more useful. Our conclusion generalizes to datasets with
very different properties. When combined with several other cost-effective
designs including separable spatial/temporal convolution and feature gating,
our system results in an effective video classification system that that
produces very competitive results on several action classification benchmarks
(Kinetics, Something-something, UCF101 and HMDB), as well as two action
detection (localization) benchmarks (JHMDB and UCF101-24).Comment: ECCV 2018 camera read
Multi-scale 3D Convolution Network for Video Based Person Re-Identification
This paper proposes a two-stream convolution network to extract spatial and
temporal cues for video based person Re-Identification (ReID). A temporal
stream in this network is constructed by inserting several Multi-scale 3D (M3D)
convolution layers into a 2D CNN network. The resulting M3D convolution network
introduces a fraction of parameters into the 2D CNN, but gains the ability of
multi-scale temporal feature learning. With this compact architecture, M3D
convolution network is also more efficient and easier to optimize than existing
3D convolution networks. The temporal stream further involves Residual
Attention Layers (RAL) to refine the temporal features. By jointly learning
spatial-temporal attention masks in a residual manner, RAL identifies the
discriminative spatial regions and temporal cues. The other stream in our
network is implemented with a 2D CNN for spatial feature extraction. The
spatial and temporal features from two streams are finally fused for the video
based person ReID. Evaluations on three widely used benchmarks datasets, i.e.,
MARS, PRID2011, and iLIDS-VID demonstrate the substantial advantages of our
method over existing 3D convolution networks and state-of-art methods.Comment: AAAI, 201
A Novel Apex-Time Network for Cross-Dataset Micro-Expression Recognition
The automatic recognition of micro-expression has been boosted ever since the
successful introduction of deep learning approaches. As researchers working on
such topics are moving to learn from the nature of micro-expression, the
practice of using deep learning techniques has evolved from processing the
entire video clip of micro-expression to the recognition on apex frame. Using
the apex frame is able to get rid of redundant video frames, but the relevant
temporal evidence of micro-expression would be thereby left out. This paper
proposes a novel Apex-Time Network (ATNet) to recognize micro-expression based
on spatial information from the apex frame as well as on temporal information
from the respective-adjacent frames. Through extensive experiments on three
benchmarks, we demonstrate the improvement achieved by learning such temporal
information. Specially, the model with such temporal information is more robust
in cross-dataset validations.Comment: 6 pages, 3 figures, 3 tables, code available, accepted in ACII 201
Collaborative Spatio-temporal Feature Learning for Video Action Recognition
Spatio-temporal feature learning is of central importance for action
recognition in videos. Existing deep neural network models either learn spatial
and temporal features independently (C2D) or jointly with unconstrained
parameters (C3D). In this paper, we propose a novel neural operation which
encodes spatio-temporal features collaboratively by imposing a weight-sharing
constraint on the learnable parameters. In particular, we perform 2D
convolution along three orthogonal views of volumetric video data,which learns
spatial appearance and temporal motion cues respectively. By sharing the
convolution kernels of different views, spatial and temporal features are
collaboratively learned and thus benefit from each other. The complementary
features are subsequently fused by a weighted summation whose coefficients are
learned end-to-end. Our approach achieves state-of-the-art performance on
large-scale benchmarks and won the 1st place in the Moments in Time Challenge
2018. Moreover, based on the learned coefficients of different views, we are
able to quantify the contributions of spatial and temporal features. This
analysis sheds light on interpretability of the model and may also guide the
future design of algorithm for video recognition.Comment: CVPR 201
- …