4 research outputs found
How Local is the Local Diversity? Reinforcing Sequential Determinantal Point Processes with Dynamic Ground Sets for Supervised Video Summarization
The large volume of video content and high viewing frequency demand automatic
video summarization algorithms, of which a key property is the capability of
modeling diversity. If videos are lengthy like hours-long egocentric videos, it
is necessary to track the temporal structures of the videos and enforce local
diversity. The local diversity refers to that the shots selected from a short
time duration are diverse but visually similar shots are allowed to co-exist in
the summary if they appear far apart in the video. In this paper, we propose a
novel probabilistic model, built upon SeqDPP, to dynamically control the time
span of a video segment upon which the local diversity is imposed. In
particular, we enable SeqDPP to learn to automatically infer how local the
local diversity is supposed to be from the input video. The resulting model is
extremely involved to train by the hallmark maximum likelihood estimation
(MLE), which further suffers from the exposure bias and non-differentiable
evaluation metrics. To tackle these problems, we instead devise a reinforcement
learning algorithm for training the proposed model. Extensive experiments
verify the advantages of our model and the new learning algorithm over
MLE-based methods