590,353 research outputs found
TagBook: A Semantic Video Representation without Supervision for Event Detection
We consider the problem of event detection in video for scenarios where only
few, or even zero examples are available for training. For this challenging
setting, the prevailing solutions in the literature rely on a semantic video
representation obtained from thousands of pre-trained concept detectors.
Different from existing work, we propose a new semantic video representation
that is based on freely available social tagged videos only, without the need
for training any intermediate concept detectors. We introduce a simple
algorithm that propagates tags from a video's nearest neighbors, similar in
spirit to the ones used for image retrieval, but redesign it for video event
detection by including video source set refinement and varying the video tag
assignment. We call our approach TagBook and study its construction,
descriptiveness and detection performance on the TRECVID 2013 and 2014
multimedia event detection datasets and the Columbia Consumer Video dataset.
Despite its simple nature, the proposed TagBook video representation is
remarkably effective for few-example and zero-example event detection, even
outperforming very recent state-of-the-art alternatives building on supervised
representations.Comment: accepted for publication as a regular paper in the IEEE Transactions
on Multimedi
Semantic Video CNNs through Representation Warping
In this work, we propose a technique to convert CNN models for semantic
segmentation of static images into CNNs for video data. We describe a warping
method that can be used to augment existing architectures with very little
extra computational cost. This module is called NetWarp and we demonstrate its
use for a range of network architectures. The main design principle is to use
optical flow of adjacent frames for warping internal network representations
across time. A key insight of this work is that fast optical flow methods can
be combined with many different CNN architectures for improved performance and
end-to-end training. Experiments validate that the proposed approach incurs
only little extra computational cost, while improving performance, when video
streams are available. We achieve new state-of-the-art results on the CamVid
and Cityscapes benchmark datasets and show consistent improvements over
different baseline networks. Our code and models will be available at
http://segmentation.is.tue.mpg.deComment: ICCV 201
- …