8,841 research outputs found
Co-Regularized Deep Representations for Video Summarization
Compact keyframe-based video summaries are a popular way of generating
viewership on video sharing platforms. Yet, creating relevant and compelling
summaries for arbitrarily long videos with a small number of keyframes is a
challenging task. We propose a comprehensive keyframe-based summarization
framework combining deep convolutional neural networks and restricted Boltzmann
machines. An original co-regularization scheme is used to discover meaningful
subject-scene associations. The resulting multimodal representations are then
used to select highly-relevant keyframes. A comprehensive user study is
conducted comparing our proposed method to a variety of schemes, including the
summarization currently in use by one of the most popular video sharing
websites. The results show that our method consistently outperforms the
baseline schemes for any given amount of keyframes both in terms of
attractiveness and informativeness. The lead is even more significant for
smaller summaries.Comment: Video summarization, deep convolutional neural networks,
co-regularized restricted Boltzmann machine
Recommended from our members
Simulating emotional reactions in medical dramas
Presenting information on emotionally charged topics is a delicate task: if bare facts alone are conveyed, there is a risk of boring the audience, or coming across as cold and unfeeling; on the other hand, emotional presentation can be appropriate when carefully handled, but when overdone or mishandled risks being perceived as patronising or in poor taste. When Natural Language Generation (NLG) systems present emotionally charged information linguistically, by generating scripts for embodied agents, emotional/affective aspects cannot be ignored. It is important to ensure that viewers consider the presentation appropriate and sympathetic.
We are investigating the role of affect in communicating medical information in the context of an NLG system that generates short medical dramas enacted by embodied agents. The dramas have both an informational and an educational purpose in that they help patients review their medical histories whilst receiving explanations of less familiar medical terms and demonstrations of their usage. The dramas are also personalised since they are generated from the patients' own medical records. We view generation of natural/appropriate emotional language as a way to engage and maintain the viewers' attention. For our medical setting, we hypothesize that viewers will consider dialogues more natural when they have an enthusiastic and sympathetic emotional tone. Our second hypothesis proposes that such dialogues are also better for engaging the viewers' attention.
As well as describing our NLG system for generating natural emotional language in medical dialogue, we present a pilot study with which we investigate our two hypotheses. Our results were not quite as unequivocal as we had hoped. Firstly, our participants did notice whether a character sympathised with the patient and was enthusiastic. This did not, however, lead them to judge such a character as behaving more naturally or the dialogue as being more engaging. However, when pooling data from our two conditions, dialogues with versus dialogues without emotionally appropriate language use, we discovered, somewhat surprisingly, that participants did consider a dialogue more engaging if they believed that the characters showed sympathy towards the patient, were not cold and unfeeling, and were natural (true for the female agent only)
Hierarchical3D Adapters for Long Video-to-text Summarization
In this paper, we focus on video-to-text summarization and investigate how to
best utilize multimodal information for summarizing long inputs (e.g., an
hour-long TV show) into long outputs (e.g., a multi-sentence summary). We
extend SummScreen (Chen et al., 2021), a dialogue summarization dataset
consisting of transcripts of TV episodes with reference summaries, and create a
multimodal variant by collecting corresponding full-length videos. We
incorporate multimodal information into a pre-trained textual summarizer
efficiently using adapter modules augmented with a hierarchical structure while
tuning only 3.8\% of model parameters. Our experiments demonstrate that
multimodal information offers superior performance over more memory-heavy and
fully fine-tuned textual summarization methods
- …