4,195 research outputs found
Knowledge will Propel Machine Understanding of Content: Extrapolating from Current Examples
Machine Learning has been a big success story during the AI resurgence. One
particular stand out success relates to learning from a massive amount of data.
In spite of early assertions of the unreasonable effectiveness of data, there
is increasing recognition for utilizing knowledge whenever it is available or
can be created purposefully. In this paper, we discuss the indispensable role
of knowledge for deeper understanding of content where (i) large amounts of
training data are unavailable, (ii) the objects to be recognized are complex,
(e.g., implicit entities and highly subjective content), and (iii) applications
need to use complementary or related data in multiple modalities/media. What
brings us to the cusp of rapid progress is our ability to (a) create relevant
and reliable knowledge and (b) carefully exploit knowledge to enhance ML/NLP
techniques. Using diverse examples, we seek to foretell unprecedented progress
in our ability for deeper understanding and exploitation of multimodal data and
continued incorporation of knowledge in learning techniques.Comment: Pre-print of the paper accepted at 2017 IEEE/WIC/ACM International
Conference on Web Intelligence (WI). arXiv admin note: substantial text
overlap with arXiv:1610.0770
I2SRM: Intra- and Inter-Sample Relationship Modeling for Multimodal Information Extraction
Multimodal information extraction is attracting research attention nowadays,
which requires aggregating representations from different modalities. In this
paper, we present the Intra- and Inter-Sample Relationship Modeling (I2SRM)
method for this task, which contains two modules. Firstly, the intra-sample
relationship modeling module operates on a single sample and aims to learn
effective representations. Embeddings from textual and visual modalities are
shifted to bridge the modality gap caused by distinct pre-trained language and
image models. Secondly, the inter-sample relationship modeling module considers
relationships among multiple samples and focuses on capturing the interactions.
An AttnMixup strategy is proposed, which not only enables collaboration among
samples but also augments data to improve generalization. We conduct extensive
experiments on the multimodal named entity recognition datasets Twitter-2015
and Twitter-2017, and the multimodal relation extraction dataset MNRE. Our
proposed method I2SRM achieves competitive results, 77.12% F1-score on
Twitter-2015, 88.40% F1-score on Twitter-2017, and 84.12% F1-score on MNRE
Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification
National Research Foundation (NRF) Singapor
Enhancing Multimodal Entity and Relation Extraction with Variational Information Bottleneck
This paper studies the multimodal named entity recognition (MNER) and
multimodal relation extraction (MRE), which are important for multimedia social
platform analysis. The core of MNER and MRE lies in incorporating evident
visual information to enhance textual semantics, where two issues inherently
demand investigations. The first issue is modality-noise, where the
task-irrelevant information in each modality may be noises misleading the task
prediction. The second issue is modality-gap, where representations from
different modalities are inconsistent, preventing from building the semantic
alignment between the text and image. To address these issues, we propose a
novel method for MNER and MRE by Multi-Modal representation learning with
Information Bottleneck (MMIB). For the first issue, a refinement-regularizer
probes the information-bottleneck principle to balance the predictive evidence
and noisy information, yielding expressive representations for prediction. For
the second issue, an alignment-regularizer is proposed, where a mutual
information-based item works in a contrastive manner to regularize the
consistent text-image representations. To our best knowledge, we are the first
to explore variational IB estimation for MNER and MRE. Experiments show that
MMIB achieves the state-of-the-art performances on three public benchmarks
Prompting ChatGPT in MNER: Enhanced Multimodal Named Entity Recognition with Auxiliary Refined Knowledge
Multimodal Named Entity Recognition (MNER) on social media aims to enhance
textual entity prediction by incorporating image-based clues. Existing studies
mainly focus on maximizing the utilization of pertinent image information or
incorporating external knowledge from explicit knowledge bases. However, these
methods either neglect the necessity of providing the model with external
knowledge, or encounter issues of high redundancy in the retrieved knowledge.
In this paper, we present PGIM -- a two-stage framework that aims to leverage
ChatGPT as an implicit knowledge base and enable it to heuristically generate
auxiliary knowledge for more efficient entity prediction. Specifically, PGIM
contains a Multimodal Similar Example Awareness module that selects suitable
examples from a small number of predefined artificial samples. These examples
are then integrated into a formatted prompt template tailored to the MNER and
guide ChatGPT to generate auxiliary refined knowledge. Finally, the acquired
knowledge is integrated with the original text and fed into a downstream model
for further processing. Extensive experiments show that PGIM outperforms
state-of-the-art methods on two classic MNER datasets and exhibits a stronger
robustness and generalization capability.Comment: Accepted to Findings of EMNLP 202
Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity and Relation Extraction
How can we better extract entities and relations from text? Using multimodal
extraction with images and text obtains more signals for entities and
relations, and aligns them through graphs or hierarchical fusion, aiding in
extraction. Despite attempts at various fusions, previous works have overlooked
many unlabeled image-caption pairs, such as NewsCLIPing. This paper proposes
innovative pre-training objectives for entity-object and relation-image
alignment, extracting objects from images and aligning them with entity and
relation prompts for soft pseudo-labels. These labels are used as
self-supervised signals for pre-training, enhancing the ability to extract
entities and relations. Experiments on three datasets show an average 3.41% F1
improvement over prior SOTA. Additionally, our method is orthogonal to previous
multimodal fusions, and using it on prior SOTA fusions further improves 5.47%
F1.Comment: Accepted to ACM Multimedia 202
- …