63 research outputs found
Discriminative Feature Learning for Unsupervised Video Summarization
In this paper, we address the problem of unsupervised video summarization
that automatically extracts key-shots from an input video. Specifically, we
tackle two critical issues based on our empirical observations: (i) Ineffective
feature learning due to flat distributions of output importance scores for each
frame, and (ii) training difficulty when dealing with long-length video inputs.
To alleviate the first problem, we propose a simple yet effective
regularization loss term called variance loss. The proposed variance loss
allows a network to predict output scores for each frame with high discrepancy
which enables effective feature learning and significantly improves model
performance. For the second problem, we design a novel two-stream network named
Chunk and Stride Network (CSNet) that utilizes local (chunk) and global
(stride) temporal view on the video features. Our CSNet gives better
summarization results for long-length videos compared to the existing methods.
In addition, we introduce an attention mechanism to handle the dynamic
information in videos. We demonstrate the effectiveness of the proposed methods
by conducting extensive ablation studies and show that our final model achieves
new state-of-the-art results on two benchmark datasets.Comment: Accepted to AAAI 2019 !!
Scale-Adaptive Video Understanding.
The recent rise of large-scale, diverse video data has urged a new era of high-level video understanding. It is increasingly critical for intelligent systems to extract semantics from videos. In this dissertation, we explore the use of supervoxel hierarchies as a type of video representation for high-level video understanding. The supervoxel hierarchies contain rich multiscale decompositions of video content, where various structures can be found at various levels. However, no single level of scale contains all the desired structures we need. It is essential to adaptively choose the scales for subsequent video analysis. Thus, we present a set of tools to manipulate scales in supervoxel hierarchies including both scale generation and scale selection methods.
In our scale generation work, we evaluate a set of seven supervoxel methods in the context of what we consider to be a good supervoxel for video representation. We address a key limitation that has traditionally prevented supervoxel scale generation on long videos. We do so by proposing an approximation framework for streaming hierarchical scale generation that is able to generate multiscale decompositions for arbitrarily-long videos using constant memory.
Subsequently, we present two scale selection methods that are able to adaptively choose the scales according to application needs. The first method flattens the entire supervoxel hierarchy into a single segmentation that overcomes the limitation induced by trivial selection of a single scale. We show that the selection can be driven by various post hoc feature criteria. The second scale selection method combines the supervoxel hierarchy with a conditional random field for the task of labeling actors and actions in videos. We formulate the scale selection problem and the video labeling problem in a joint framework. Experiments on a novel large-scale video dataset demonstrate the effectiveness of the explicit consideration of scale selection in video understanding.
Aside from the computational methods, we present a visual psychophysical study to quantify how well the actor and action semantics in high-level video understanding are retained in supervoxel hierarchies. The ultimate findings suggest that some semantics are well-retained in the supervoxel hierarchies and can be used for further video analysis.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/133202/1/cliangxu_1.pd
- …