642 research outputs found
Move Forward and Tell: A Progressive Generator of Video Descriptions
We present an efficient framework that can generate a coherent paragraph to
describe a given video. Previous works on video captioning usually focus on
video clips. They typically treat an entire video as a whole and generate the
caption conditioned on a single embedding. On the contrary, we consider videos
with rich temporal structures and aim to generate paragraph descriptions that
can preserve the story flow while being coherent and concise. Towards this
goal, we propose a new approach, which produces a descriptive paragraph by
assembling temporally localized descriptions. Given a video, it selects a
sequence of distinctive clips and generates sentences thereon in a coherent
manner. Particularly, the selection of clips and the production of sentences
are done jointly and progressively driven by a recurrent network -- what to
describe next depends on what have been said before. Here, the recurrent
network is learned via self-critical sequence training with both sentence-level
and paragraph-level rewards. On the ActivityNet Captions dataset, our method
demonstrated the capability of generating high-quality paragraph descriptions
for videos. Compared to those by other methods, the descriptions produced by
our method are often more relevant, more coherent, and more concise.Comment: Accepted by ECCV 201
Video Summarization Using Deep Neural Networks: A Survey
Video summarization technologies aim to create a concise and complete
synopsis by selecting the most informative parts of the video content. Several
approaches have been developed over the last couple of decades and the current
state of the art is represented by methods that rely on modern deep neural
network architectures. This work focuses on the recent advances in the area and
provides a comprehensive survey of the existing deep-learning-based methods for
generic video summarization. After presenting the motivation behind the
development of technologies for video summarization, we formulate the video
summarization task and discuss the main characteristics of a typical
deep-learning-based analysis pipeline. Then, we suggest a taxonomy of the
existing algorithms and provide a systematic review of the relevant literature
that shows the evolution of the deep-learning-based video summarization
technologies and leads to suggestions for future developments. We then report
on protocols for the objective evaluation of video summarization algorithms and
we compare the performance of several deep-learning-based approaches. Based on
the outcomes of these comparisons, as well as some documented considerations
about the suitability of evaluation protocols, we indicate potential future
research directions.Comment: Journal paper; Under revie
MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization
Video summarization intends to produce a concise video summary by effectively
capturing and combining the most informative parts of the whole content.
Existing approaches for video summarization regard the task as a frame-wise
keyframe selection problem and generally construct the frame-wise
representation by combining the long-range temporal dependency with the
unimodal or bimodal information. However, the optimal video summaries need to
reflect the most valuable keyframe with its own information, and one with
semantic power of the whole content. Thus, it is critical to construct a more
powerful and robust frame-wise representation and predict the frame-level
importance score in a fair and comprehensive manner. To tackle the above
issues, we propose a multimodal hierarchical shot-aware convolutional network,
denoted as MHSCNet, to enhance the frame-wise representation via combining the
comprehensive available multimodal information. Specifically, we design a
hierarchical ShotConv network to incorporate the adaptive shot-aware
frame-level representation by considering the short-range and long-range
temporal dependency. Based on the learned shot-aware representations, MHSCNet
can predict the frame-level importance score in the local and global view of
the video. Extensive experiments on two standard video summarization datasets
demonstrate that our proposed method consistently outperforms state-of-the-art
baselines. Source code will be made publicly available
Self-Supervised and Controlled Multi-Document Opinion Summarization
We address the problem of unsupervised abstractive summarization of
collections of user generated reviews with self-supervision and control. We
propose a self-supervised setup that considers an individual document as a
target summary for a set of similar documents. This setting makes training
simpler than previous approaches by relying only on standard log-likelihood
loss. We address the problem of hallucinations through the use of control
codes, to steer the generation towards more coherent and relevant
summaries.Finally, we extend the Transformer architecture to allow for multiple
reviews as input. Our benchmarks on two datasets against graph-based and recent
neural abstractive unsupervised models show that our proposed method generates
summaries with a superior quality and relevance.This is confirmed in our human
evaluation which focuses explicitly on the faithfulness of generated summaries
We also provide an ablation study, which shows the importance of the control
setup in controlling hallucinations and achieve high sentiment and topic
alignment of the summaries with the input reviews.Comment: 18 pages including 5 pages appendi
Convolutional Hierarchical Attention Network for Query-Focused Video Summarization
Previous approaches for video summarization mainly concentrate on finding the
most diverse and representative visual contents as video summary without
considering the user's preference. This paper addresses the task of
query-focused video summarization, which takes user's query and a long video as
inputs and aims to generate a query-focused video summary. In this paper, we
consider the task as a problem of computing similarity between video shots and
query. To this end, we propose a method, named Convolutional Hierarchical
Attention Network (CHAN), which consists of two parts: feature encoding network
and query-relevance computing module. In the encoding network, we employ a
convolutional network with local self-attention mechanism and query-aware
global attention mechanism to learns visual information of each shot. The
encoded features will be sent to query-relevance computing module to generate
queryfocused video summary. Extensive experiments on the benchmark dataset
demonstrate the competitive performance and show the effectiveness of our
approach.Comment: Accepted by AAAI 2020 Conferenc
AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video Summarization
This paper presents a new method for unsupervised video summarization. The proposed architecture embeds an Actor-Critic model into a Generative Adversarial Network and formulates the selection of important video fragments (that will be used to form the summary) as a sequence generation task. The Actor and the Critic take part in a game that incrementally leads to the selection of the video key-fragments, and their choices at each step of the game result in a set of rewards from the Discriminator. The designed training workflow allows the Actor and Critic to discover a space of actions and automatically learn a policy for key-fragment selection. Moreover, the introduced criterion for choosing the best model after the training ends, enables the automatic selection of proper values for parameters of the training process that are not learned from the data (such as the regularization factor σ). Experimental evaluation on two benchmark datasets (SumMe and TVSum) demonstrates that the proposed AC-SUM-GAN model performs consistently well and gives SoA results in comparison to unsupervised methods, that are also competitive with respect to supervised methods
- …