20 research outputs found
Hierarchical3D Adapters for Long Video-to-text Summarization
In this paper, we focus on video-to-text summarization and investigate how to
best utilize multimodal information for summarizing long inputs (e.g., an
hour-long TV show) into long outputs (e.g., a multi-sentence summary). We
extend SummScreen (Chen et al., 2021), a dialogue summarization dataset
consisting of transcripts of TV episodes with reference summaries, and create a
multimodal variant by collecting corresponding full-length videos. We
incorporate multimodal information into a pre-trained textual summarizer
efficiently using adapter modules augmented with a hierarchical structure while
tuning only 3.8\% of model parameters. Our experiments demonstrate that
multimodal information offers superior performance over more memory-heavy and
fully fine-tuned textual summarization methods
DTV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization
Many-to-many multimodal summarization (MS) task aims to generate
summaries in any language with document inputs in any language and the
corresponding image sequence, which essentially comprises multimodal
monolingual summarization (MMS) and multimodal cross-lingual summarization
(MXLS) tasks. Although much work has been devoted to either MMS or MXLS and has
obtained increasing attention in recent years, little research pays attention
to the MS task. Besides, existing studies mainly focus on 1) utilizing MMS
to enhance MXLS via knowledge distillation without considering the performance
of MMS or 2) improving MMS models by filtering summary-unrelated visual
features with implicit learning or explicitly complex training objectives. In
this paper, we first introduce a general and practical task, i.e., MS.
Further, we propose a dual knowledge distillation and target-oriented vision
modeling framework for the MS task. Specifically, the dual knowledge
distillation method guarantees that the knowledge of MMS and MXLS can be
transferred to each other and thus mutually prompt both of them. To offer
target-oriented visual features, a simple yet effective target-oriented
contrastive objective is designed and responsible for discarding needless
visual information. Extensive experiments on the many-to-many setting show the
effectiveness of the proposed approach. Additionally, we will contribute a
many-to-many multimodal summarization (MSum) dataset.Comment: EMNLP 2023 Finding
Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning
Developing artificial learning systems that can understand and generate natural language has been one of the long-standing goals of artificial intelligence. Recent decades have witnessed an impressive progress on both of these problems, giving rise to a new family of approaches. Especially, the advances in deep learning over the past couple of years have led to neural approaches to natural language generation (NLG). These methods combine generative language learning techniques with neural-networks based frameworks. With a wide range of applications in natural language processing, neural NLG (NNLG) is a new and fast growing field of research. In this state-of-the-art report, we investigate the recent developments and applications of NNLG in its full extent from a multidimensional view, covering critical perspectives such as multimodality, multilinguality, controllability and learning strategies. We summarize the fundamental building blocks of NNLG approaches from these aspects and provide detailed reviews of commonly used preprocessing steps and basic neural architectures. This report also focuses on the seminal applications of these NNLG models such as machine translation, description generation, automatic speech recognition, abstractive summarization, text simplification, question answering and generation, and dialogue generation. Finally, we conclude with a thorough discussion of the described frameworks by pointing out some open research directions.This work has been partially supported by the European Commission ICT COST Action âMulti-task, Multilingual, Multi-modal Language Generationâ (CA18231). AE was supported by BAGEP 2021 Award of the Science Academy. EE was supported in part by TUBA GEBIP 2018 Award. BP is in in part funded by Independent Research Fund Denmark (DFF) grant 9063-00077B. IC has received funding from the European Unionâs Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 838188. EL is partly funded by Generalitat Valenciana and the Spanish Government throught projects PROMETEU/2018/089 and RTI2018-094649-B-I00, respectively. SMI is partly funded by UNIRI project uniri-drustv-18-20. GB is partly supported by the Ministry of Innovation and the National Research, Development and Innovation Office within the framework of the Hungarian Artificial Intelligence National Laboratory Programme. COT is partially funded by the Romanian Ministry of European Investments and Projects through the Competitiveness Operational Program (POC) project âHOLOTRAINâ (grant no. 29/221 ap2/07.04.2020, SMIS code: 129077) and by the German Academic Exchange Service (DAAD) through the project âAWAKEN: content-Aware and netWork-Aware faKE News mitigationâ (grant no. 91809005). ESA is partially funded by the German Academic Exchange Service (DAAD) through the project âDeep-Learning Anomaly Detection for Human and Automated Users Behaviorâ (grant no. 91809358)
Evaluating and Improving Factuality in Multimodal Abstractive Summarization
Current metrics for evaluating factuality for abstractive document
summarization have achieved high correlations with human judgment, but they do
not account for the vision modality and thus are not adequate for
vision-and-language summarization. We propose CLIPBERTScore, a simple weighted
combination of CLIPScore and BERTScore to leverage the robustness and strong
factuality detection performance between image-summary and document-summary,
respectively. Next, due to the lack of meta-evaluation benchmarks to evaluate
the quality of multimodal factuality metrics, we collect human judgments of
factuality with respect to documents and images. We show that this simple
combination of two metrics in the zero-shot setting achieves higher
correlations than existing factuality metrics for document summarization,
outperforms an existing multimodal summarization metric, and performs
competitively with strong multimodal factuality metrics specifically fine-tuned
for the task. Our thorough analysis demonstrates the robustness and high
correlation of CLIPBERTScore and its components on four factuality
metric-evaluation benchmarks. Finally, we demonstrate two practical downstream
applications of our CLIPBERTScore metric: for selecting important images to
focus on during training, and as a reward for reinforcement learning to improve
factuality of multimodal summary generation w.r.t automatic and human
evaluation. Our data and code are publicly available at
https://github.com/meetdavidwan/faithful-multimodal-summComment: EMNLP 2022 (17 pages
MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization
Video summarization intends to produce a concise video summary by effectively
capturing and combining the most informative parts of the whole content.
Existing approaches for video summarization regard the task as a frame-wise
keyframe selection problem and generally construct the frame-wise
representation by combining the long-range temporal dependency with the
unimodal or bimodal information. However, the optimal video summaries need to
reflect the most valuable keyframe with its own information, and one with
semantic power of the whole content. Thus, it is critical to construct a more
powerful and robust frame-wise representation and predict the frame-level
importance score in a fair and comprehensive manner. To tackle the above
issues, we propose a multimodal hierarchical shot-aware convolutional network,
denoted as MHSCNet, to enhance the frame-wise representation via combining the
comprehensive available multimodal information. Specifically, we design a
hierarchical ShotConv network to incorporate the adaptive shot-aware
frame-level representation by considering the short-range and long-range
temporal dependency. Based on the learned shot-aware representations, MHSCNet
can predict the frame-level importance score in the local and global view of
the video. Extensive experiments on two standard video summarization datasets
demonstrate that our proposed method consistently outperforms state-of-the-art
baselines. Source code will be made publicly available