3,269 research outputs found
Hierarchical Attention Network for Action Segmentation
The temporal segmentation of events is an essential task and a precursor for
the automatic recognition of human actions in the video. Several attempts have
been made to capture frame-level salient aspects through attention but they
lack the capacity to effectively map the temporal relationships in between the
frames as they only capture a limited span of temporal dependencies. To this
end we propose a complete end-to-end supervised learning approach that can
better learn relationships between actions over time, thus improving the
overall segmentation performance. The proposed hierarchical recurrent attention
framework analyses the input video at multiple temporal scales, to form
embeddings at frame level and segment level, and perform fine-grained action
segmentation. This generates a simple, lightweight, yet extremely effective
architecture for segmenting continuous video streams and has multiple
application domains. We evaluate our system on multiple challenging public
benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech
Egocentric datasets, and achieves state-of-the-art performance. The evaluated
datasets encompass numerous video capture settings which are inclusive of
static overhead camera views and dynamic, ego-centric head-mounted camera
views, demonstrating the direct applicability of the proposed framework in a
variety of settings.Comment: Published in Pattern Recognition Letter
Describing Videos by Exploiting Temporal Structure
Recent progress in using recurrent neural networks (RNNs) for image
description has motivated the exploration of their application for video
description. However, while images are static, working with videos requires
modeling their dynamic temporal structure and then properly integrating that
information into a natural language description. In this context, we propose an
approach that successfully takes into account both the local and global
temporal structure of videos to produce descriptions. First, our approach
incorporates a spatial temporal 3-D convolutional neural network (3-D CNN)
representation of the short temporal dynamics. The 3-D CNN representation is
trained on video action recognition tasks, so as to produce a representation
that is tuned to human motion and behavior. Second we propose a temporal
attention mechanism that allows to go beyond local temporal modeling and learns
to automatically select the most relevant temporal segments given the
text-generating RNN. Our approach exceeds the current state-of-art for both
BLEU and METEOR metrics on the Youtube2Text dataset. We also present results on
a new, larger and more challenging dataset of paired video and natural language
descriptions.Comment: Accepted to ICCV15. This version comes with code release and
supplementary materia
Object Referring in Videos with Language and Human Gaze
We investigate the problem of object referring (OR) i.e. to localize a target
object in a visual scene coming with a language description. Humans perceive
the world more as continued video snippets than as static images, and describe
objects not only by their appearance, but also by their spatio-temporal context
and motion features. Humans also gaze at the object when they issue a referring
expression. Existing works for OR mostly focus on static images only, which
fall short in providing many such cues. This paper addresses OR in videos with
language and human gaze. To that end, we present a new video dataset for OR,
with 30, 000 objects over 5, 000 stereo video sequences annotated for their
descriptions and gaze. We further propose a novel network model for OR in
videos, by integrating appearance, motion, gaze, and spatio-temporal context
into one network. Experimental results show that our method effectively
utilizes motion cues, human gaze, and spatio-temporal context. Our method
outperforms previousOR methods. For dataset and code, please refer
https://people.ee.ethz.ch/~arunv/ORGaze.html.Comment: Accepted to CVPR 2018, 10 pages, 6 figure
Interaction-aware Spatio-temporal Pyramid Attention Networks for Action Classification
Local features at neighboring spatial positions in feature maps have high
correlation since their receptive fields are often overlapped. Self-attention
usually uses the weighted sum (or other functions) with internal elements of
each local feature to obtain its weight score, which ignores interactions among
local features. To address this, we propose an effective interaction-aware
self-attention model inspired by PCA to learn attention maps. Furthermore,
since different layers in a deep network capture feature maps of different
scales, we use these feature maps to construct a spatial pyramid and then
utilize multi-scale information to obtain more accurate attention scores, which
are used to weight the local features in all spatial positions of feature maps
to calculate attention maps. Moreover, our spatial pyramid attention is
unrestricted to the number of its input feature maps so it is easily extended
to a spatio-temporal version. Finally, our model is embedded in general CNNs to
form end-to-end attention networks for action classification. Experimental
results show that our method achieves the state-of-the-art results on the
UCF101, HMDB51 and untrimmed Charades.Comment: Accepted by ECCV201
- …