5,270 research outputs found
Similarity Reasoning and Filtration for Image-Text Matching
Image-text matching plays a critical role in bridging the vision and
language, and great progress has been made by exploiting the global alignment
between image and sentence, or local alignments between regions and words.
However, how to make the most of these alignments to infer more accurate
matching scores is still underexplored. In this paper, we propose a novel
Similarity Graph Reasoning and Attention Filtration (SGRAF) network for
image-text matching. Specifically, the vector-based similarity representations
are firstly learned to characterize the local and global alignments in a more
comprehensive manner, and then the Similarity Graph Reasoning (SGR) module
relying on one graph convolutional neural network is introduced to infer
relation-aware similarities with both the local and global alignments. The
Similarity Attention Filtration (SAF) module is further developed to integrate
these alignments effectively by selectively attending on the significant and
representative alignments and meanwhile casting aside the interferences of
non-meaningful alignments. We demonstrate the superiority of the proposed
method with achieving state-of-the-art performances on the Flickr30K and MSCOCO
datasets, and the good interpretability of SGR and SAF modules with extensive
qualitative experiments and analyses.Comment: 14 pages, 8 figures, Accepted by AAAI202
Context-Aware Embeddings for Automatic Art Analysis
Automatic art analysis aims to classify and retrieve artistic representations
from a collection of images by using computer vision and machine learning
techniques. In this work, we propose to enhance visual representations from
neural networks with contextual artistic information. Whereas visual
representations are able to capture information about the content and the style
of an artwork, our proposed context-aware embeddings additionally encode
relationships between different artistic attributes, such as author, school, or
historical period. We design two different approaches for using context in
automatic art analysis. In the first one, contextual data is obtained through a
multi-task learning model, in which several attributes are trained together to
find visual relationships between elements. In the second approach, context is
obtained through an art-specific knowledge graph, which encodes relationships
between artistic attributes. An exhaustive evaluation of both of our models in
several art analysis problems, such as author identification, type
classification, or cross-modal retrieval, show that performance is improved by
up to 7.3% in art classification and 37.24% in retrieval when context-aware
embeddings are used
Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey
Knowledge Graphs (KGs) play a pivotal role in advancing various AI
applications, with the semantic web community's exploration into multi-modal
dimensions unlocking new avenues for innovation. In this survey, we carefully
review over 300 articles, focusing on KG-aware research in two principal
aspects: KG-driven Multi-Modal (KG4MM) learning, where KGs support multi-modal
tasks, and Multi-Modal Knowledge Graph (MM4KG), which extends KG studies into
the MMKG realm. We begin by defining KGs and MMKGs, then explore their
construction progress. Our review includes two primary task categories:
KG-aware multi-modal learning tasks, such as Image Classification and Visual
Question Answering, and intrinsic MMKG tasks like Multi-modal Knowledge Graph
Completion and Entity Alignment, highlighting specific research trajectories.
For most of these tasks, we provide definitions, evaluation benchmarks, and
additionally outline essential insights for conducting relevant research.
Finally, we discuss current challenges and identify emerging trends, such as
progress in Large Language Modeling and Multi-modal Pre-training strategies.
This survey aims to serve as a comprehensive reference for researchers already
involved in or considering delving into KG and multi-modal learning research,
offering insights into the evolving landscape of MMKG research and supporting
future work.Comment: Ongoing work; 41 pages (Main Text), 55 pages (Total), 11 Tables, 13
Figures, 619 citations; Paper list is available at
https://github.com/zjukg/KG-MM-Surve
A Survey on Interpretable Cross-modal Reasoning
In recent years, cross-modal reasoning (CMR), the process of understanding
and reasoning across different modalities, has emerged as a pivotal area with
applications spanning from multimedia analysis to healthcare diagnostics. As
the deployment of AI systems becomes more ubiquitous, the demand for
transparency and comprehensibility in these systems' decision-making processes
has intensified. This survey delves into the realm of interpretable cross-modal
reasoning (I-CMR), where the objective is not only to achieve high predictive
performance but also to provide human-understandable explanations for the
results. This survey presents a comprehensive overview of the typical methods
with a three-level taxonomy for I-CMR. Furthermore, this survey reviews the
existing CMR datasets with annotations for explanations. Finally, this survey
summarizes the challenges for I-CMR and discusses potential future directions.
In conclusion, this survey aims to catalyze the progress of this emerging
research area by providing researchers with a panoramic and comprehensive
perspective, illuminating the state of the art and discerning the
opportunities
Cycle-Consistent Deep Generative Hashing for Cross-Modal Retrieval
In this paper, we propose a novel deep generative approach to cross-modal
retrieval to learn hash functions in the absence of paired training samples
through the cycle consistency loss. Our proposed approach employs adversarial
training scheme to lean a couple of hash functions enabling translation between
modalities while assuming the underlying semantic relationship. To induce the
hash codes with semantics to the input-output pair, cycle consistency loss is
further proposed upon the adversarial training to strengthen the correlations
between inputs and corresponding outputs. Our approach is generative to learn
hash functions such that the learned hash codes can maximally correlate each
input-output correspondence, meanwhile can also regenerate the inputs so as to
minimize the information loss. The learning to hash embedding is thus performed
to jointly optimize the parameters of the hash functions across modalities as
well as the associated generative models. Extensive experiments on a variety of
large-scale cross-modal data sets demonstrate that our proposed method achieves
better retrieval results than the state-of-the-arts.Comment: To appeared on IEEE Trans. Image Processing. arXiv admin note: text
overlap with arXiv:1703.10593 by other author
- …