14,199 research outputs found
Multimodal Grounding for Language Processing
This survey discusses how recent developments in multimodal processing
facilitate conceptual grounding of language. We categorize the information flow
in multimodal processing with respect to cognitive models of human information
processing and analyze different methods for combining multimodal
representations. Based on this methodological inventory, we discuss the benefit
of multimodal grounding for a variety of language processing tasks and the
challenges that arise. We particularly focus on multimodal grounding of verbs
which play a crucial role for the compositional power of language.Comment: The paper has been published in the Proceedings of the 27 Conference
of Computational Linguistics. Please refer to this version for citations:
https://www.aclweb.org/anthology/papers/C/C18/C18-1197
Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers
The massive amounts of digitized historical documents acquired over the last
decades naturally lend themselves to automatic processing and exploration.
Research work seeking to automatically process facsimiles and extract
information thereby are multiplying with, as a first essential step, document
layout analysis. If the identification and categorization of segments of
interest in document images have seen significant progress over the last years
thanks to deep learning techniques, many challenges remain with, among others,
the use of finer-grained segmentation typologies and the consideration of
complex, heterogeneous documents such as historical newspapers. Besides, most
approaches consider visual features only, ignoring textual signal. In this
context, we introduce a multimodal approach for the semantic segmentation of
historical newspapers that combines visual and textual features. Based on a
series of experiments on diachronic Swiss and Luxembourgish newspapers, we
investigate, among others, the predictive power of visual and textual features
and their capacity to generalize across time and sources. Results show
consistent improvement of multimodal models in comparison to a strong visual
baseline, as well as better robustness to high material variance
Team Triple-Check at Factify 2: Parameter-Efficient Large Foundation Models with Feature Representations for Multi-Modal Fact Verification
Multi-modal fact verification has become an important but challenging issue
on social media due to the mismatch between the text and images in the
misinformation of news content, which has been addressed by considering
cross-modalities to identify the veracity of the news in recent years. In this
paper, we propose the Pre-CoFactv2 framework with new parameter-efficient
foundation models for modeling fine-grained text and input embeddings with
lightening parameters, multi-modal multi-type fusion for not only capturing
relations for the same and different modalities but also for different types
(i.e., claim and document), and feature representations for explicitly
providing metadata for each sample. In addition, we introduce a unified
ensemble method to boost model performance by adjusting the importance of each
trained model with not only the weights but also the powers. Extensive
experiments show that Pre-CoFactv2 outperforms Pre-CoFact by a large margin and
achieved new state-of-the-art results at the Factify challenge at AAAI 2023. We
further illustrate model variations to verify the relative contributions of
different components. Our team won the first prize (F1-score: 81.82%) and we
made our code publicly available at
https://github.com/wwweiwei/Pre-CoFactv2-AAAI-2023.Comment: AAAI-23 DeFactify 2 Workshop (1st Prize
Multi-Modal Discussion Transformer: Integrating Text, Images and Graph Transformers to Detect Hate Speech on Social Media
We present the Multi-Modal Discussion Transformer (mDT), a novel multi-modal
graph-based transformer model for detecting hate speech in online social
networks. In contrast to traditional text-only methods, our approach to
labelling a comment as hate speech centers around the holistic analysis of text
and images. This is done by leveraging graph transformers to capture the
contextual relationships in the entire discussion that surrounds a comment,
with interwoven fusion layers to combine text and image embeddings instead of
processing different modalities separately. We compare the performance of our
model to baselines that only process text; we also conduct extensive ablation
studies. We conclude with future work for multimodal solutions to deliver
social value in online contexts, arguing that capturing a holistic view of a
conversation greatly advances the effort to detect anti-social behavior.Comment: Under Submissio
- …