14,880 research outputs found
Temporal Relational Reasoning in Videos
Temporal relational reasoning, the ability to link meaningful transformations
of objects or entities over time, is a fundamental property of intelligent
species. In this paper, we introduce an effective and interpretable network
module, the Temporal Relation Network (TRN), designed to learn and reason about
temporal dependencies between video frames at multiple time scales. We evaluate
TRN-equipped networks on activity recognition tasks using three recent video
datasets - Something-Something, Jester, and Charades - which fundamentally
depend on temporal relational reasoning. Our results demonstrate that the
proposed TRN gives convolutional neural networks a remarkable capacity to
discover temporal relations in videos. Through only sparsely sampled video
frames, TRN-equipped networks can accurately predict human-object interactions
in the Something-Something dataset and identify various human gestures on the
Jester dataset with very competitive performance. TRN-equipped networks also
outperform two-stream networks and 3D convolution networks in recognizing daily
activities in the Charades dataset. Further analyses show that the models learn
intuitive and interpretable visual common sense knowledge in videos.Comment: camera-ready version for ECCV'1
Understanding and Improving Recurrent Networks for Human Activity Recognition by Continuous Attention
Deep neural networks, including recurrent networks, have been successfully
applied to human activity recognition. Unfortunately, the final representation
learned by recurrent networks might encode some noise (irrelevant signal
components, unimportant sensor modalities, etc.). Besides, it is difficult to
interpret the recurrent networks to gain insight into the models' behavior. To
address these issues, we propose two attention models for human activity
recognition: temporal attention and sensor attention. These two mechanisms
adaptively focus on important signals and sensor modalities. To further improve
the understandability and mean F1 score, we add continuity constraints,
considering that continuous sensor signals are more robust than discrete ones.
We evaluate the approaches on three datasets and obtain state-of-the-art
results. Furthermore, qualitative analysis shows that the attention learned by
the models agree well with human intuition.Comment: 8 pages. published in The International Symposium on Wearable
Computers (ISWC) 201
Going Deeper into Action Recognition: A Survey
Understanding human actions in visual data is tied to advances in
complementary research areas including object recognition, human dynamics,
domain adaptation and semantic segmentation. Over the last decade, human action
analysis evolved from earlier schemes that are often limited to controlled
environments to nowadays advanced solutions that can learn from millions of
videos and apply to almost all daily activities. Given the broad range of
applications from video surveillance to human-computer interaction, scientific
milestones in action recognition are achieved more rapidly, eventually leading
to the demise of what used to be good in a short time. This motivated us to
provide a comprehensive review of the notable steps taken towards recognizing
human actions. To this end, we start our discussion with the pioneering methods
that use handcrafted representations, and then, navigate into the realm of deep
learning based approaches. We aim to remain objective throughout this survey,
touching upon encouraging improvements as well as inevitable fallbacks, in the
hope of raising fresh questions and motivating new research directions for the
reader
- …