11 research outputs found
Discriminative Feature Learning for Unsupervised Video Summarization
In this paper, we address the problem of unsupervised video summarization
that automatically extracts key-shots from an input video. Specifically, we
tackle two critical issues based on our empirical observations: (i) Ineffective
feature learning due to flat distributions of output importance scores for each
frame, and (ii) training difficulty when dealing with long-length video inputs.
To alleviate the first problem, we propose a simple yet effective
regularization loss term called variance loss. The proposed variance loss
allows a network to predict output scores for each frame with high discrepancy
which enables effective feature learning and significantly improves model
performance. For the second problem, we design a novel two-stream network named
Chunk and Stride Network (CSNet) that utilizes local (chunk) and global
(stride) temporal view on the video features. Our CSNet gives better
summarization results for long-length videos compared to the existing methods.
In addition, we introduce an attention mechanism to handle the dynamic
information in videos. We demonstrate the effectiveness of the proposed methods
by conducting extensive ablation studies and show that our final model achieves
new state-of-the-art results on two benchmark datasets.Comment: Accepted to AAAI 2019 !!
Query Twice: Dual Mixture Attention Meta Learning for Video Summarization
Video summarization aims to select representative frames to retain high-level
information, which is usually solved by predicting the segment-wise importance
score via a softmax function. However, softmax function suffers in retaining
high-rank representations for complex visual or sequential information, which
is known as the Softmax Bottleneck problem. In this paper, we propose a novel
framework named Dual Mixture Attention (DMASum) model with Meta Learning for
video summarization that tackles the softmax bottleneck problem, where the
Mixture of Attention layer (MoA) effectively increases the model capacity by
employing twice self-query attention that can capture the second-order changes
in addition to the initial query-key attention, and a novel Single Frame Meta
Learning rule is then introduced to achieve more generalization to small
datasets with limited training sources. Furthermore, the DMASum significantly
exploits both visual and sequential attention that connects local key-frame and
global attention in an accumulative way. We adopt the new evaluation protocol
on two public datasets, SumMe, and TVSum. Both qualitative and quantitative
experiments manifest significant improvements over the state-of-the-art
methods.Comment: This manuscript has been accepted at ACM MM 202
MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization
Video summarization intends to produce a concise video summary by effectively
capturing and combining the most informative parts of the whole content.
Existing approaches for video summarization regard the task as a frame-wise
keyframe selection problem and generally construct the frame-wise
representation by combining the long-range temporal dependency with the
unimodal or bimodal information. However, the optimal video summaries need to
reflect the most valuable keyframe with its own information, and one with
semantic power of the whole content. Thus, it is critical to construct a more
powerful and robust frame-wise representation and predict the frame-level
importance score in a fair and comprehensive manner. To tackle the above
issues, we propose a multimodal hierarchical shot-aware convolutional network,
denoted as MHSCNet, to enhance the frame-wise representation via combining the
comprehensive available multimodal information. Specifically, we design a
hierarchical ShotConv network to incorporate the adaptive shot-aware
frame-level representation by considering the short-range and long-range
temporal dependency. Based on the learned shot-aware representations, MHSCNet
can predict the frame-level importance score in the local and global view of
the video. Extensive experiments on two standard video summarization datasets
demonstrate that our proposed method consistently outperforms state-of-the-art
baselines. Source code will be made publicly available