19,289 research outputs found
Deep Motion Features for Visual Tracking
Robust visual tracking is a challenging computer vision problem, with many
real-world applications. Most existing approaches employ hand-crafted
appearance features, such as HOG or Color Names. Recently, deep RGB features
extracted from convolutional neural networks have been successfully applied for
tracking. Despite their success, these features only capture appearance
information. On the other hand, motion cues provide discriminative and
complementary information that can improve tracking performance. Contrary to
visual tracking, deep motion features have been successfully applied for action
recognition and video classification tasks. Typically, the motion features are
learned by training a CNN on optical flow images extracted from large amounts
of labeled videos.
This paper presents an investigation of the impact of deep motion features in
a tracking-by-detection framework. We further show that hand-crafted, deep RGB,
and deep motion features contain complementary information. To the best of our
knowledge, we are the first to propose fusing appearance information with deep
motion features for visual tracking. Comprehensive experiments clearly suggest
that our fusion approach with deep motion features outperforms standard methods
relying on appearance information alone.Comment: ICPR 2016. Best paper award in the "Computer Vision and Robot Vision"
trac
Object Referring in Videos with Language and Human Gaze
We investigate the problem of object referring (OR) i.e. to localize a target
object in a visual scene coming with a language description. Humans perceive
the world more as continued video snippets than as static images, and describe
objects not only by their appearance, but also by their spatio-temporal context
and motion features. Humans also gaze at the object when they issue a referring
expression. Existing works for OR mostly focus on static images only, which
fall short in providing many such cues. This paper addresses OR in videos with
language and human gaze. To that end, we present a new video dataset for OR,
with 30, 000 objects over 5, 000 stereo video sequences annotated for their
descriptions and gaze. We further propose a novel network model for OR in
videos, by integrating appearance, motion, gaze, and spatio-temporal context
into one network. Experimental results show that our method effectively
utilizes motion cues, human gaze, and spatio-temporal context. Our method
outperforms previousOR methods. For dataset and code, please refer
https://people.ee.ethz.ch/~arunv/ORGaze.html.Comment: Accepted to CVPR 2018, 10 pages, 6 figure
- …