152 research outputs found
Video Summarization Using Deep Neural Networks: A Survey
Video summarization technologies aim to create a concise and complete
synopsis by selecting the most informative parts of the video content. Several
approaches have been developed over the last couple of decades and the current
state of the art is represented by methods that rely on modern deep neural
network architectures. This work focuses on the recent advances in the area and
provides a comprehensive survey of the existing deep-learning-based methods for
generic video summarization. After presenting the motivation behind the
development of technologies for video summarization, we formulate the video
summarization task and discuss the main characteristics of a typical
deep-learning-based analysis pipeline. Then, we suggest a taxonomy of the
existing algorithms and provide a systematic review of the relevant literature
that shows the evolution of the deep-learning-based video summarization
technologies and leads to suggestions for future developments. We then report
on protocols for the objective evaluation of video summarization algorithms and
we compare the performance of several deep-learning-based approaches. Based on
the outcomes of these comparisons, as well as some documented considerations
about the suitability of evaluation protocols, we indicate potential future
research directions.Comment: Journal paper; Under revie
Deep attentive video summarization with distribution consistency learning
This article studies supervised video summarization by formulating it into a sequence-to-sequence learning framework, in which the input and output are sequences of original video frames and their predicted importance scores, respectively. Two critical issues are addressed in this article: short-term contextual attention insufficiency and distribution inconsistency. The former lies in the insufficiency of capturing the short-term contextual attention information within the video sequence itself since the existing approaches focus a lot on the long-term encoder-decoder attention. The latter refers to the distributions of predicted importance score sequence and the ground-truth sequence is inconsistent, which may lead to a suboptimal solution. To better mitigate the first issue, we incorporate a self-attention mechanism in the encoder to highlight the important keyframes in a short-term context. The proposed approach alongside the encoder-decoder attention constitutes our deep attentive models for video summarization. For the second one, we propose a distribution consistency learning method by employing a simple yet effective regularization loss term, which seeks a consistent distribution for the two sequences. Our final approach is dubbed as Attentive and Distribution consistent video Summarization (ADSum). Extensive experiments on benchmark data sets demonstrate the superiority of the proposed ADSum approach against state-of-the-art approaches
AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video Summarization
This paper presents a new method for unsupervised video summarization. The proposed architecture embeds an Actor-Critic model into a Generative Adversarial Network and formulates the selection of important video fragments (that will be used to form the summary) as a sequence generation task. The Actor and the Critic take part in a game that incrementally leads to the selection of the video key-fragments, and their choices at each step of the game result in a set of rewards from the Discriminator. The designed training workflow allows the Actor and Critic to discover a space of actions and automatically learn a policy for key-fragment selection. Moreover, the introduced criterion for choosing the best model after the training ends, enables the automatic selection of proper values for parameters of the training process that are not learned from the data (such as the regularization factor σ). Experimental evaluation on two benchmark datasets (SumMe and TVSum) demonstrates that the proposed AC-SUM-GAN model performs consistently well and gives SoA results in comparison to unsupervised methods, that are also competitive with respect to supervised methods
How Local is the Local Diversity? Reinforcing Sequential Determinantal Point Processes with Dynamic Ground Sets for Supervised Video Summarization
The large volume of video content and high viewing frequency demand automatic
video summarization algorithms, of which a key property is the capability of
modeling diversity. If videos are lengthy like hours-long egocentric videos, it
is necessary to track the temporal structures of the videos and enforce local
diversity. The local diversity refers to that the shots selected from a short
time duration are diverse but visually similar shots are allowed to co-exist in
the summary if they appear far apart in the video. In this paper, we propose a
novel probabilistic model, built upon SeqDPP, to dynamically control the time
span of a video segment upon which the local diversity is imposed. In
particular, we enable SeqDPP to learn to automatically infer how local the
local diversity is supposed to be from the input video. The resulting model is
extremely involved to train by the hallmark maximum likelihood estimation
(MLE), which further suffers from the exposure bias and non-differentiable
evaluation metrics. To tackle these problems, we instead devise a reinforcement
learning algorithm for training the proposed model. Extensive experiments
verify the advantages of our model and the new learning algorithm over
MLE-based methods
Dilated Temporal Relational Adversarial Network for Generic Video Summarization
The large amount of videos popping up every day, make it more and more
critical that key information within videos can be extracted and understood in
a very short time. Video summarization, the task of finding the smallest subset
of frames, which still conveys the whole story of a given video, is thus of
great significance to improve efficiency of video understanding. We propose a
novel Dilated Temporal Relational Generative Adversarial Network (DTR-GAN) to
achieve frame-level video summarization. Given a video, it selects the set of
key frames, which contain the most meaningful and compact information.
Specifically, DTR-GAN learns a dilated temporal relational generator and a
discriminator with three-player loss in an adversarial manner. A new dilated
temporal relation (DTR) unit is introduced to enhance temporal representation
capturing. The generator uses this unit to effectively exploit global
multi-scale temporal context to select key frames and to complement the
commonly used Bi-LSTM. To ensure that summaries capture enough key video
representation from a global perspective rather than a trivial randomly shorten
sequence, we present a discriminator that learns to enforce both the
information completeness and compactness of summaries via a three-player loss.
The loss includes the generated summary loss, the random summary loss, and the
real summary (ground-truth) loss, which play important roles for better
regularizing the learned model to obtain useful summaries. Comprehensive
experiments on three public datasets show the effectiveness of the proposed
approach
Video summarization through reinforcement learning with a 3D spatio-temporal U-Net
Intelligent video summarization algorithms allow to quickly convey the most relevant information in videos through the identification of the most essential and explanatory content while removing redundant video frames. In this paper, we introduce the 3DST-UNet-RL framework for video summarization. A 3D spatio-temporal U-Net is used to efficiently encode spatio-temporal information of the input videos for downstream reinforcement learning (RL). An RL agent learns from spatio-temporal latent scores and predicts actions for keeping or rejecting a video frame in a video summary. We investigate if real/inflated 3D spatio-temporal CNN features are better suited to learn representations from videos than commonly used 2D image features. Our framework can operate in both, a fully unsupervised mode and a supervised training mode. We analyse the impact of prescribed summary lengths and show experimental evidence for the effectiveness of 3DST-UNet-RL on two commonly used general video summarization benchmarks. We also applied our method on a medical video summarization task. The proposed video summarization method has the potential to save storage costs of ultrasound screening videos as well as to increase efficiency when browsing patient video data during retrospective analysis or audit without loosing essential information
- …