5 research outputs found

    Query-controllable Video Summarization

    Full text link
    When video collections become huge, how to explore both within and across videos efficiently is challenging. Video summarization is one of the ways to tackle this issue. Traditional summarization approaches limit the effectiveness of video exploration because they only generate one fixed video summary for a given input video independent of the information need of the user. In this work, we introduce a method which takes a text-based query as input and generates a video summary corresponding to it. We do so by modeling video summarization as a supervised learning problem and propose an end-to-end deep learning based method for query-controllable video summarization to generate a query-dependent video summary. Our proposed method consists of a video summary controller, video summary generator, and video summary output module. To foster the research of query-controllable video summarization and conduct our experiments, we introduce a dataset that contains frame-based relevance score labels. Based on our experimental result, it shows that the text-based query helps control the video summary. It also shows the text-based query improves our model performance. Our code and dataset: https://github.com/Jhhuangkay/Query-controllable-Video-Summarization.Comment: This paper is accepted by ACM International Conference on Multimedia Retrieval (ICMR), 202

    DeepOpht: Medical Report Generation for Retinal Images via Deep Models and Visual Explanation

    Full text link
    In this work, we propose an AI-based method that intends to improve the conventional retinal disease treatment procedure and help ophthalmologists increase diagnosis efficiency and accuracy. The proposed method is composed of a deep neural networks-based (DNN-based) module, including a retinal disease identifier and clinical description generator, and a DNN visual explanation module. To train and validate the effectiveness of our DNN-based module, we propose a large-scale retinal disease image dataset. Also, as ground truth, we provide a retinal image dataset manually labeled by ophthalmologists to qualitatively show, the proposed AI-based method is effective. With our experimental results, we show that the proposed method is quantitatively and qualitatively effective. Our method is capable of creating meaningful retinal image descriptions and visual explanations that are clinically relevant.Comment: Accepted to IEEE WACV 202

    Egocentric video summarisation via purpose-orientedframe scoring and selection

    Get PDF
    Existing video summarisation techniques are quite generic in nature, since they generally overlook the important aspect of what actual purpose the summary will be serving. In sharp contrast with this mainstream work, it can be acknowledged that there are many possible purposes the same videos can be summarised for. Accordingly, we consider a novel perspective: summaries with a purpose. This work is an attempt to both, call the attention on this neglected aspect of video summarisation research, and to illustrate it and explore it with two concrete purposes, focusing on first-person-view videos. The proposed purpose-oriented summarisation techniques are framed under the common (frame-level) scoring and selection paradigm, and have been tested on two egocentric datasets, BEOID and EGTEA-Gaze+. The necessary purpose-specific evaluation metrics are also introduced. The proposed approach is compared with two purpose-agnostic summarisation baselines. On the one hand, a partially agnostic method uses the scores obtained by the proposed approach, but follows a standard generic frame selection technique. On the other hand, the fully agnostic method do not use any purpose-based information, and relies on generic concepts such as diversity and representativeness. The results of the experimental work show that the proposed approaches compare favourably with respect to both baselines. More specifically, the purpose-specific approach generally produces summaries with the best compromise between summary lengths and favourable purpose-specific metrics. Interestingly, it is also observed that results of the partially-agnostic baseline tend to be better than those of the fully-agnostic one. These observations provide strong evidence on the advantage and relevance of purpose-specific summarisation techniques and evaluation metrics, and encourage further work on this important subject.Funding for open access charge: CRUE-Universitat Jaume

    Video Summarization Using Unsupervised Deep Learning

    Get PDF
    In this thesis, we address the task of video summarization using unsupervised deep-learning architectures. Video summarization aims to generate a short summary by selecting the most informative and important frames (key-frames) or fragments (key-fragments) of the full-length video, and presenting them in temporally-ordered fashion. Our objective is to overcome observed weaknesses of existing video summarization approaches that utilize RNNs for modeling the temporal dependence of frames, related to: i) the small influence of the estimated frame-level importance scores in the created video summary, ii) the insufficiency of RNNs to model long-range frames' dependence, and iii) the small amount of parallelizable operations during the training of RNNs. To address the first weakness, we propose a new unsupervised network architecture, called AC-SUM-GAN, which formulates the selection of important video fragments as a sequence generation task and learns this task by embedding an Actor-Critic model in a Generative Adversarial Network. The feedback of a trainable Discriminator is used as a reward by the Actor-Critic model in order to explore a space of actions and learn a value function (Critic) and a policy (Actor) for video fragment selection. To tackle the remaining weaknesses, we investigate the use of attention mechanisms for video summarization and propose a new supervised network architecture, called PGL-SUM, that combines global and local multi-head attention mechanisms which take into account the temporal position of the video frames, in order to discover different modelings of the frames' dependencies at different levels of granularity. Based on the acquired experience, we then propose a new unsupervised network architecture, called CA-SUM, which estimates the frames' importance using a novel concentrated attention mechanism that focuses on non-overlapping blocks in the main diagonal of the attention matrix and takes into account the attentive uniqueness and diversity of the associated frames of the video. All the proposed architectures have been extensively evaluated on the most commonly-used benchmark datasets, demonstrating their competitiveness against other approaches and documenting the contribution of our proposals on advancing the current state-of-the-art on video summarization. Finally, we make a first attempt on producing explanations for the video summarization results. Inspired by relevant works in the Natural Language Processing domain, we propose an attention-based method for explainable video summarization and we evaluate the performance of various explanation signals using our CA-SUM architecture and two benchmark datasets for video summarization. The experimental results indicate the advanced performance of explanation signals formed using the inherent attention weights, and demonstrate the ability of the proposed method to explain the video summarization results using clues about the focus of the attention mechanism

    Query-Controllable Video Summarization

    No full text
    corecore