247 research outputs found
Facilitating the Production of Well-tailored Video Summaries for Sharing on Social Media
This paper presents a web-based tool that facilitates the production of
tailored summaries for online sharing on social media. Through an interactive
user interface, it supports a ``one-click'' video summarization process. Based
on the integrated AI models for video summarization and aspect ratio
transformation, it facilitates the generation of multiple summaries of a
full-length video according to the needs of target platforms with regard to the
video's length and aspect ratio.Comment: Accepted for publication, 30th Int. Conf. on MultiMedia Modeling (MMM
2024), Amsterdam, NL, Jan.-Feb. 2024. This is the "submitted manuscript"
versio
Dilated Temporal Relational Adversarial Network for Generic Video Summarization
The large amount of videos popping up every day, make it more and more
critical that key information within videos can be extracted and understood in
a very short time. Video summarization, the task of finding the smallest subset
of frames, which still conveys the whole story of a given video, is thus of
great significance to improve efficiency of video understanding. We propose a
novel Dilated Temporal Relational Generative Adversarial Network (DTR-GAN) to
achieve frame-level video summarization. Given a video, it selects the set of
key frames, which contain the most meaningful and compact information.
Specifically, DTR-GAN learns a dilated temporal relational generator and a
discriminator with three-player loss in an adversarial manner. A new dilated
temporal relation (DTR) unit is introduced to enhance temporal representation
capturing. The generator uses this unit to effectively exploit global
multi-scale temporal context to select key frames and to complement the
commonly used Bi-LSTM. To ensure that summaries capture enough key video
representation from a global perspective rather than a trivial randomly shorten
sequence, we present a discriminator that learns to enforce both the
information completeness and compactness of summaries via a three-player loss.
The loss includes the generated summary loss, the random summary loss, and the
real summary (ground-truth) loss, which play important roles for better
regularizing the learned model to obtain useful summaries. Comprehensive
experiments on three public datasets show the effectiveness of the proposed
approach
Video Summarization Using Deep Neural Networks: A Survey
Video summarization technologies aim to create a concise and complete
synopsis by selecting the most informative parts of the video content. Several
approaches have been developed over the last couple of decades and the current
state of the art is represented by methods that rely on modern deep neural
network architectures. This work focuses on the recent advances in the area and
provides a comprehensive survey of the existing deep-learning-based methods for
generic video summarization. After presenting the motivation behind the
development of technologies for video summarization, we formulate the video
summarization task and discuss the main characteristics of a typical
deep-learning-based analysis pipeline. Then, we suggest a taxonomy of the
existing algorithms and provide a systematic review of the relevant literature
that shows the evolution of the deep-learning-based video summarization
technologies and leads to suggestions for future developments. We then report
on protocols for the objective evaluation of video summarization algorithms and
we compare the performance of several deep-learning-based approaches. Based on
the outcomes of these comparisons, as well as some documented considerations
about the suitability of evaluation protocols, we indicate potential future
research directions.Comment: Journal paper; Under revie
Query Twice: Dual Mixture Attention Meta Learning for Video Summarization
Video summarization aims to select representative frames to retain high-level
information, which is usually solved by predicting the segment-wise importance
score via a softmax function. However, softmax function suffers in retaining
high-rank representations for complex visual or sequential information, which
is known as the Softmax Bottleneck problem. In this paper, we propose a novel
framework named Dual Mixture Attention (DMASum) model with Meta Learning for
video summarization that tackles the softmax bottleneck problem, where the
Mixture of Attention layer (MoA) effectively increases the model capacity by
employing twice self-query attention that can capture the second-order changes
in addition to the initial query-key attention, and a novel Single Frame Meta
Learning rule is then introduced to achieve more generalization to small
datasets with limited training sources. Furthermore, the DMASum significantly
exploits both visual and sequential attention that connects local key-frame and
global attention in an accumulative way. We adopt the new evaluation protocol
on two public datasets, SumMe, and TVSum. Both qualitative and quantitative
experiments manifest significant improvements over the state-of-the-art
methods.Comment: This manuscript has been accepted at ACM MM 202
Masked Autoencoder for Unsupervised Video Summarization
Summarizing a video requires a diverse understanding of the video, ranging
from recognizing scenes to evaluating how much each frame is essential enough
to be selected as a summary. Self-supervised learning (SSL) is acknowledged for
its robustness and flexibility to multiple downstream tasks, but the video SSL
has not shown its value for dense understanding tasks like video summarization.
We claim an unsupervised autoencoder with sufficient self-supervised learning
does not need any extra downstream architecture design or fine-tuning weights
to be utilized as a video summarization model. The proposed method to evaluate
the importance score of each frame takes advantage of the reconstruction score
of the autoencoder's decoder. We evaluate the method in major unsupervised
video summarization benchmarks to show its effectiveness under various
experimental settings
Egocentric Vision-based Future Vehicle Localization for Intelligent Driving Assistance Systems
Predicting the future location of vehicles is essential for safety-critical
applications such as advanced driver assistance systems (ADAS) and autonomous
driving. This paper introduces a novel approach to simultaneously predict both
the location and scale of target vehicles in the first-person (egocentric) view
of an ego-vehicle. We present a multi-stream recurrent neural network (RNN)
encoder-decoder model that separately captures both object location and scale
and pixel-level observations for future vehicle localization. We show that
incorporating dense optical flow improves prediction results significantly
since it captures information about motion as well as appearance change. We
also find that explicitly modeling future motion of the ego-vehicle improves
the prediction accuracy, which could be especially beneficial in intelligent
and automated vehicles that have motion planning capability. To evaluate the
performance of our approach, we present a new dataset of first-person videos
collected from a variety of scenarios at road intersections, which are
particularly challenging moments for prediction because vehicle trajectories
are diverse and dynamic.Comment: To appear on ICRA 201
Video Summarization Using Unsupervised Deep Learning
In this thesis, we address the task of video summarization using unsupervised deep-learning architectures. Video summarization aims to generate a short summary by selecting the most informative and important frames (key-frames) or fragments (key-fragments) of the full-length video, and presenting them in temporally-ordered fashion. Our objective is to overcome observed weaknesses of existing video summarization approaches that utilize RNNs for modeling the temporal dependence of frames, related to: i) the small influence of the estimated frame-level importance scores in the created video summary, ii) the insufficiency of RNNs to model long-range frames' dependence, and iii) the small amount of parallelizable operations during the training of RNNs. To address the first weakness, we propose a new unsupervised network architecture, called AC-SUM-GAN, which formulates the selection of important video fragments as a sequence generation task and learns this task by embedding an Actor-Critic model in a Generative Adversarial Network. The feedback of a trainable Discriminator is used as a reward by the Actor-Critic model in order to explore a space of actions and learn a value function (Critic) and a policy (Actor) for video fragment selection. To tackle the remaining weaknesses, we investigate the use of attention mechanisms for video summarization and propose a new supervised network architecture, called PGL-SUM, that combines global and local multi-head attention mechanisms which take into account the temporal position of the video frames, in order to discover different modelings of the frames' dependencies at different levels of granularity. Based on the acquired experience, we then propose a new unsupervised network architecture, called CA-SUM, which estimates the frames' importance using a novel concentrated attention mechanism that focuses on non-overlapping blocks in the main diagonal of the attention matrix and takes into account the attentive uniqueness and diversity of the associated frames of the video. All the proposed architectures have been extensively evaluated on the most commonly-used benchmark datasets, demonstrating their competitiveness against other approaches and documenting the contribution of our proposals on advancing the current state-of-the-art on video summarization. Finally, we make a first attempt on producing explanations for the video summarization results. Inspired by relevant works in the Natural Language Processing domain, we propose an attention-based method for explainable video summarization and we evaluate the performance of various explanation signals using our CA-SUM architecture and two benchmark datasets for video summarization. The experimental results indicate the advanced performance of explanation signals formed using the inherent attention weights, and demonstrate the ability of the proposed method to explain the video summarization results using clues about the focus of the attention mechanism
- …