4 research outputs found
Dilated Temporal Relational Adversarial Network for Generic Video Summarization
The large amount of videos popping up every day, make it more and more
critical that key information within videos can be extracted and understood in
a very short time. Video summarization, the task of finding the smallest subset
of frames, which still conveys the whole story of a given video, is thus of
great significance to improve efficiency of video understanding. We propose a
novel Dilated Temporal Relational Generative Adversarial Network (DTR-GAN) to
achieve frame-level video summarization. Given a video, it selects the set of
key frames, which contain the most meaningful and compact information.
Specifically, DTR-GAN learns a dilated temporal relational generator and a
discriminator with three-player loss in an adversarial manner. A new dilated
temporal relation (DTR) unit is introduced to enhance temporal representation
capturing. The generator uses this unit to effectively exploit global
multi-scale temporal context to select key frames and to complement the
commonly used Bi-LSTM. To ensure that summaries capture enough key video
representation from a global perspective rather than a trivial randomly shorten
sequence, we present a discriminator that learns to enforce both the
information completeness and compactness of summaries via a three-player loss.
The loss includes the generated summary loss, the random summary loss, and the
real summary (ground-truth) loss, which play important roles for better
regularizing the learned model to obtain useful summaries. Comprehensive
experiments on three public datasets show the effectiveness of the proposed
approach
Effective video summarization approach based on visual attention
Video summarization is applied to reduce redundancy and develop a concise representation of key frames in the video, more recently, video summaries have been used through visual attention modeling. In these schemes, the frames that stand out visually are extracted as key frames based on human attention modeling theories. The schemes for modeling visual attention have proven to be effective for video summaries. Nevertheless, the high cost of computing in such techniques restricts their usability in everyday situations. In this context, we propose a method based on KFE (key frame extraction) technique, which is recommended based on an efficient and accurate visual attention model. The calculation effort is minimized by utilizing dynamic visual highlighting based on the temporal gradient instead of the traditional optical flow techniques. In addition, an efficient technique using a discrete cosine transformation is utilized for the static visual salience. The dynamic and static visual attention metrics are merged by means of a non-linear weighted fusion technique. Results of the systemare compared with some existing stateof- the-art techniques for the betterment of accuracy. The experimental results of our proposed model indicate the efficiency and high standard in terms of the key frames extraction as output.Qatar University - No. IRCC-2021-010