1,003 research outputs found
Relational Network for Skeleton-Based Action Recognition
With the fast development of effective and low-cost human skeleton capture
systems, skeleton-based action recognition has attracted much attention
recently. Most existing methods use Convolutional Neural Network (CNN) and
Recurrent Neural Network (RNN) to extract spatio-temporal information embedded
in the skeleton sequences for action recognition. However, these approaches are
limited in the ability of relational modeling in a single skeleton, due to the
loss of important structural information when converting the raw skeleton data
to adapt to the input format of CNN or RNN. In this paper, we propose an
Attentional Recurrent Relational Network-LSTM (ARRN-LSTM) to simultaneously
model spatial configurations and temporal dynamics in skeletons for action
recognition. We introduce the Recurrent Relational Network to learn the spatial
features in a single skeleton, followed by a multi-layer LSTM to learn the
temporal features in the skeleton sequences. Between the two modules, we design
an adaptive attentional module to focus attention on the most discriminative
parts in the single skeleton. To exploit the complementarity from different
geometries in the skeleton for sufficient relational modeling, we design a
two-stream architecture to learn the structural features among joints and lines
simultaneously. Extensive experiments are conducted on several popular skeleton
datasets and the results show that the proposed approach achieves better
results than most mainstream methods.Comment: Accepted by International Conference on Multimedia and Expo(ICME)
2019 as Ora
Learning Discriminative Motion Features Through Detection
Despite huge success in the image domain, modern detection models such as
Faster R-CNN have not been used nearly as much for video analysis. This is
arguably due to the fact that detection models are designed to operate on
single frames and as a result do not have a mechanism for learning motion
representations directly from video. We propose a learning procedure that
allows detection models such as Faster R-CNN to learn motion features directly
from the RGB video data while being optimized with respect to a pose estimation
task. Given a pair of video frames---Frame A and Frame B---we force our model
to predict human pose in Frame A using the features from Frame B. We do so by
leveraging deformable convolutions across space and time. Our network learns to
spatially sample features from Frame B in order to maximize pose detection
accuracy in Frame A. This naturally encourages our network to learn motion
offsets encoding the spatial correspondences between the two frames. We refer
to these motion offsets as DiMoFs (Discriminative Motion Features).
In our experiments we show that our training scheme helps learn effective
motion cues, which can be used to estimate and localize salient human motion.
Furthermore, we demonstrate that as a byproduct, our model also learns features
that lead to improved pose detection in still-images, and better keypoint
tracking. Finally, we show how to leverage our learned model for the tasks of
spatiotemporal action localization and fine-grained action recognition
Space-Time Representation of People Based on 3D Skeletal Data: A Review
Spatiotemporal human representation based on 3D visual perception data is a
rapidly growing research area. Based on the information sources, these
representations can be broadly categorized into two groups based on RGB-D
information or 3D skeleton data. Recently, skeleton-based human representations
have been intensively studied and kept attracting an increasing attention, due
to their robustness to variations of viewpoint, human body scale and motion
speed as well as the realtime, online performance. This paper presents a
comprehensive survey of existing space-time representations of people based on
3D skeletal data, and provides an informative categorization and analysis of
these methods from the perspectives, including information modality,
representation encoding, structure and transition, and feature engineering. We
also provide a brief overview of skeleton acquisition devices and construction
methods, enlist a number of public benchmark datasets with skeleton data, and
discuss potential future research directions.Comment: Our paper has been accepted by the journal Computer Vision and Image
Understanding, see
http://www.sciencedirect.com/science/article/pii/S1077314217300279, Computer
Vision and Image Understanding, 201
A Simple Baseline for Audio-Visual Scene-Aware Dialog
The recently proposed audio-visual scene-aware dialog task paves the way to a
more data-driven way of learning virtual assistants, smart speakers and car
navigation systems. However, very little is known to date about how to
effectively extract meaningful information from a plethora of sensors that
pound the computational engine of those devices. Therefore, in this paper, we
provide and carefully analyze a simple baseline for audio-visual scene-aware
dialog which is trained end-to-end. Our method differentiates in a data-driven
manner useful signals from distracting ones using an attention mechanism. We
evaluate the proposed approach on the recently introduced and challenging
audio-visual scene-aware dataset, and demonstrate the key features that permit
to outperform the current state-of-the-art by more than 20\% on CIDEr.Comment: Accepted to CVPR 201
Convolutional Relational Machine for Group Activity Recognition
We present an end-to-end deep Convolutional Neural Network called
Convolutional Relational Machine (CRM) for recognizing group activities that
utilizes the information in spatial relations between individual persons in
image or video. It learns to produce an intermediate spatial representation
(activity map) based on individual and group activities. A multi-stage
refinement component is responsible for decreasing the incorrect predictions in
the activity map. Finally, an aggregation component uses the refined
information to recognize group activities. Experimental results demonstrate the
constructive contribution of the information extracted and represented in the
form of the activity map. CRM shows advantages over state-of-the-art models on
Volleyball and Collective Activity datasets
Evolving Space-Time Neural Architectures for Videos
We present a new method for finding video CNN architectures that capture rich
spatio-temporal information in videos. Previous work, taking advantage of 3D
convolutions, obtained promising results by manually designing video CNN
architectures. We here develop a novel evolutionary search algorithm that
automatically explores models with different types and combinations of layers
to jointly learn interactions between spatial and temporal aspects of video
representations. We demonstrate the generality of this algorithm by applying it
to two meta-architectures, obtaining new architectures superior to manually
designed architectures. Further, we propose a new component, the iTGM layer,
which more efficiently utilizes its parameters to allow learning of space-time
interactions over longer time horizons. The iTGM layer is often preferred by
the evolutionary algorithm and allows building cost-efficient networks. The
proposed approach discovers new and diverse video architectures that were
previously unknown. More importantly they are both more accurate and faster
than prior models, and outperform the state-of-the-art results on multiple
datasets we test, including HMDB, Kinetics, and Moments in Time. We will open
source the code and models, to encourage future model development
Hierarchical Feature Aggregation Networks for Video Action Recognition
Most action recognition methods base on a) a late aggregation of frame level
CNN features using average pooling, max pooling, or RNN, among others, or b)
spatio-temporal aggregation via 3D convolutions. The first assume independence
among frame features up to a certain level of abstraction and then perform
higher-level aggregation, while the second extracts spatio-temporal features
from grouped frames as early fusion. In this paper we explore the space in
between these two, by letting adjacent feature branches interact as they
develop into the higher level representation. The interaction happens between
feature differencing and averaging at each level of the hierarchy, and it has
convolutional structure that learns to select the appropriate mode locally in
contrast to previous works that impose one of the modes globally (e.g. feature
differencing) as a design choice. We further constrain this interaction to be
conservative, e.g. a local feature subtraction in one branch is compensated by
the addition on another, such that the total feature flow is preserved. We
evaluate the performance of our proposal on a number of existing models, i.e.
TSN, TRN and ECO, to show its flexibility and effectiveness in improving action
recognition performance
Cooperative Cross-Stream Network for Discriminative Action Representation
Spatial and temporal stream model has gained great success in video action
recognition. Most existing works pay more attention to designing effective
features fusion methods, which train the two-stream model in a separate way.
However, it's hard to ensure discriminability and explore complementary
information between different streams in existing works. In this work, we
propose a novel cooperative cross-stream network that investigates the conjoint
information in multiple different modalities. The jointly spatial and temporal
stream networks feature extraction is accomplished by an end-to-end learning
manner. It extracts this complementary information of different modality from a
connection block, which aims at exploring correlations of different stream
features. Furthermore, different from the conventional ConvNet that learns the
deep separable features with only one cross-entropy loss, our proposed model
enhances the discriminative power of the deeply learned features and reduces
the undesired modality discrepancy by jointly optimizing a modality ranking
constraint and a cross-entropy loss for both homogeneous and heterogeneous
modalities. The modality ranking constraint constitutes intra-modality
discriminative embedding and inter-modality triplet constraint, and it reduces
both the intra-modality and cross-modality feature variations. Experiments on
three benchmark datasets demonstrate that by cooperating appearance and motion
feature extraction, our method can achieve state-of-the-art or competitive
performance compared with existing results.Comment: 10 pages, 6 figure
SCAN: Self-and-Collaborative Attention Network for Video Person Re-identification
Video person re-identification attracts much attention in recent years. It
aims to match image sequences of pedestrians from different camera views.
Previous approaches usually improve this task from three aspects, including a)
selecting more discriminative frames, b) generating more informative temporal
representations, and c) developing more effective distance metrics. To address
the above issues, we present a novel and practical deep architecture for video
person re-identification termed Self-and-Collaborative Attention Network
(SCAN). It has several appealing properties. First, SCAN adopts non-parametric
attention mechanism to refine the intra-sequence and inter-sequence feature
representation of videos, and outputs self-and-collaborative feature
representation for each video, making the discriminative frames aligned between
the probe and gallery sequences.Second, beyond existing models, a generalized
pairwise similarity measurement is proposed to calculate the similarity feature
representations of video pairs, enabling computing the matching scores by the
binary classifier. Third, a dense clip segmentation strategy is also introduced
to generate rich probe-gallery pairs to optimize the model. Extensive
experiments demonstrate the effectiveness of SCAN, which outperforms the
best-performing baselines on iLIDS-VID, PRID2011 and MARS dataset,
respectively.Comment: 10 pages, 5 figure
Reasoning about Body-Parts Relations for Sign Language Recognition
Over the years, hand gesture recognition has been mostly addressed
considering hand trajectories in isolation. However, in most sign languages,
hand gestures are defined on a particular context (body region). We propose a
pipeline to perform sign language recognition which models hand movements in
the context of other parts of the body captured in the 3D space using the MS
Kinect sensor. In addition, we perform sign recognition based on the different
hand postures that occur during a sign. Our experiments show that considering
different body parts brings improved performance when compared to other methods
which only consider global hand trajectories. Finally, we demonstrate that the
combination of hand postures features with hand gestures features helps to
improve the prediction of a given sign.Comment: Under Review ( 15 Pages: 13 Figures, 6 Tables
- …