6,128 research outputs found
Hierarchical3D Adapters for Long Video-to-text Summarization
In this paper, we focus on video-to-text summarization and investigate how to
best utilize multimodal information for summarizing long inputs (e.g., an
hour-long TV show) into long outputs (e.g., a multi-sentence summary). We
extend SummScreen (Chen et al., 2021), a dialogue summarization dataset
consisting of transcripts of TV episodes with reference summaries, and create a
multimodal variant by collecting corresponding full-length videos. We
incorporate multimodal information into a pre-trained textual summarizer
efficiently using adapter modules augmented with a hierarchical structure while
tuning only 3.8\% of model parameters. Our experiments demonstrate that
multimodal information offers superior performance over more memory-heavy and
fully fine-tuned textual summarization methods
Система реферирования мультимодальной информации
Розглянуто процес реферування мультимодальної інформації. Тобто початкова інформація може представлятися у вигляді тексту, зображення, аудіо або відео. Пропонується модель процесу реферування, поліпшена за рахунок введення додаткового етапу, що дозволяє обробляти мультимодальну інформацію. Наводяться алгоритм функціонування системи реферування на цьому етапі та модель процесу перетворення початкового документа у внутрішній формат системи.Рассмотрен процесс реферирования мультимодальной информации. То есть исходная информация может представляться в виде текста, изображения, аудио или видео. Предлагается модель процесса реферирования, улучшенная за счёт введения дополнительного этапа, позволяющего обрабатывать мультимодальную информацию. Приводятся алгоритм функционирования системы реферирования на этом этапе и модель процесса преобразования исходного документа во внутренний формат системы.This paper considers the process of multimodal information summarization. Input information can appear as a text, image, audio or video. There had been suggested the model of summarization process, improved by introducing of the additional stage, which allows to process multimodal information. The algorithm of the summarization system functioning on mentioned stage and model of transformation process of input document to the system's internal format had also been described
Video Storytelling: Textual Summaries for Events
Bridging vision and natural language is a longstanding goal in computer
vision and multimedia research. While earlier works focus on generating a
single-sentence description for visual content, recent works have studied
paragraph generation. In this work, we introduce the problem of video
storytelling, which aims at generating coherent and succinct stories for long
videos. Video storytelling introduces new challenges, mainly due to the
diversity of the story and the length and complexity of the video. We propose
novel methods to address the challenges. First, we propose a context-aware
framework for multimodal embedding learning, where we design a Residual
Bidirectional Recurrent Neural Network to leverage contextual information from
past and future. Second, we propose a Narrator model to discover the underlying
storyline. The Narrator is formulated as a reinforcement learning agent which
is trained by directly optimizing the textual metric of the generated story. We
evaluate our method on the Video Story dataset, a new dataset that we have
collected to enable the study. We compare our method with multiple
state-of-the-art baselines, and show that our method achieves better
performance, in terms of quantitative measures and user study.Comment: Published in IEEE Transactions on Multimedi
Query-Based Summarization using Rhetorical Structure Theory
Research on Question Answering is focused mainly on classifying the question type and finding
the answer. Presenting the answer in a way that suits the user’s needs has received little
attention. This paper shows how existing question answering systems—which aim at finding
precise answers to questions—can be improved by exploiting summarization techniques to extract
more than just the answer from the document in which the answer resides. This is done
using a graph search algorithm which searches for relevant sentences in the discourse structure,
which is represented as a graph. The Rhetorical Structure Theory (RST) is used to create a
graph representation of a text document. The output is an extensive answer, which not only
answers the question, but also gives the user an opportunity to assess the accuracy of the answer
(is this what I am looking for?), and to find additional information that is related to the question,
and which may satisfy an information need. This has been implemented in a working multimodal
question answering system where it operates with two independently developed question
answering modules
DTV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization
Many-to-many multimodal summarization (MS) task aims to generate
summaries in any language with document inputs in any language and the
corresponding image sequence, which essentially comprises multimodal
monolingual summarization (MMS) and multimodal cross-lingual summarization
(MXLS) tasks. Although much work has been devoted to either MMS or MXLS and has
obtained increasing attention in recent years, little research pays attention
to the MS task. Besides, existing studies mainly focus on 1) utilizing MMS
to enhance MXLS via knowledge distillation without considering the performance
of MMS or 2) improving MMS models by filtering summary-unrelated visual
features with implicit learning or explicitly complex training objectives. In
this paper, we first introduce a general and practical task, i.e., MS.
Further, we propose a dual knowledge distillation and target-oriented vision
modeling framework for the MS task. Specifically, the dual knowledge
distillation method guarantees that the knowledge of MMS and MXLS can be
transferred to each other and thus mutually prompt both of them. To offer
target-oriented visual features, a simple yet effective target-oriented
contrastive objective is designed and responsible for discarding needless
visual information. Extensive experiments on the many-to-many setting show the
effectiveness of the proposed approach. Additionally, we will contribute a
many-to-many multimodal summarization (MSum) dataset.Comment: EMNLP 2023 Finding
Interactive Search and Exploration in Online Discussion Forums Using Multimodal Embeddings
In this paper we present a novel interactive multimodal learning system,
which facilitates search and exploration in large networks of social multimedia
users. It allows the analyst to identify and select users of interest, and to
find similar users in an interactive learning setting. Our approach is based on
novel multimodal representations of users, words and concepts, which we
simultaneously learn by deploying a general-purpose neural embedding model. We
show these representations to be useful not only for categorizing users, but
also for automatically generating user and community profiles. Inspired by
traditional summarization approaches, we create the profiles by selecting
diverse and representative content from all available modalities, i.e. the
text, image and user modality. The usefulness of the approach is evaluated
using artificial actors, which simulate user behavior in a relevance feedback
scenario. Multiple experiments were conducted in order to evaluate the quality
of our multimodal representations, to compare different embedding strategies,
and to determine the importance of different modalities. We demonstrate the
capabilities of the proposed approach on two different multimedia collections
originating from the violent online extremism forum Stormfront and the
microblogging platform Twitter, which are particularly interesting due to the
high semantic level of the discussions they feature
A novel user-centered design for personalized video summarization
In the past, several automatic video summarization systems had been proposed to generate video summary. However, a generic video summary that is generated based only on audio, visual and textual saliencies will not satisfy every user. This paper proposes a novel system for generating semantically meaningful personalized video summaries, which are tailored to the individual user's preferences over video semantics. Each video shot is represented using a semantic multinomial which is a vector of posterior semantic concept probabilities. The proposed system stitches video summary based on summary time span and top-ranked shots that are semantically relevant to the user's preferences. The proposed summarization system is evaluated using both quantitative and subjective evaluation metrics. The experimental results on the performance of the proposed video summarization system are encouraging
- …