3,633 research outputs found
Scene Graph Generation via Conditional Random Fields
Despite the great success object detection and segmentation models have
achieved in recognizing individual objects in images, performance on cognitive
tasks such as image caption, semantic image retrieval, and visual QA is far
from satisfactory. To achieve better performance on these cognitive tasks,
merely recognizing individual object instances is insufficient. Instead, the
interactions between object instances need to be captured in order to
facilitate reasoning and understanding of the visual scenes in an image. Scene
graph, a graph representation of images that captures object instances and
their relationships, offers a comprehensive understanding of an image. However,
existing techniques on scene graph generation fail to distinguish subjects and
objects in the visual scenes of images and thus do not perform well with
real-world datasets where exist ambiguous object instances. In this work, we
propose a novel scene graph generation model for predicting object instances
and its corresponding relationships in an image. Our model, SG-CRF, learns the
sequential order of subject and object in a relationship triplet, and the
semantic compatibility of object instance nodes and relationship nodes in a
scene graph efficiently. Experiments empirically show that SG-CRF outperforms
the state-of-the-art methods, on three different datasets, i.e., CLEVR, VRD,
and Visual Genome, raising the Recall@100 from 24.99% to 49.95%, from 41.92% to
50.47%, and from 54.69% to 54.77%, respectively
Visual Question Answering: A Survey of Methods and Datasets
Visual Question Answering (VQA) is a challenging task that has received
increasing attention from both the computer vision and the natural language
processing communities. Given an image and a question in natural language, it
requires reasoning over visual elements of the image and general knowledge to
infer the correct answer. In the first part of this survey, we examine the
state of the art by comparing modern approaches to the problem. We classify
methods by their mechanism to connect the visual and textual modalities. In
particular, we examine the common approach of combining convolutional and
recurrent neural networks to map images and questions to a common feature
space. We also discuss memory-augmented and modular architectures that
interface with structured knowledge bases. In the second part of this survey,
we review the datasets available for training and evaluating VQA systems. The
various datatsets contain questions at different levels of complexity, which
require different capabilities and types of reasoning. We examine in depth the
question/answer pairs from the Visual Genome project, and evaluate the
relevance of the structured annotations of images with scene graphs for VQA.
Finally, we discuss promising future directions for the field, in particular
the connection to structured knowledge bases and the use of natural language
processing models.Comment: 25 page
A Survey on Interpretable Cross-modal Reasoning
In recent years, cross-modal reasoning (CMR), the process of understanding
and reasoning across different modalities, has emerged as a pivotal area with
applications spanning from multimedia analysis to healthcare diagnostics. As
the deployment of AI systems becomes more ubiquitous, the demand for
transparency and comprehensibility in these systems' decision-making processes
has intensified. This survey delves into the realm of interpretable cross-modal
reasoning (I-CMR), where the objective is not only to achieve high predictive
performance but also to provide human-understandable explanations for the
results. This survey presents a comprehensive overview of the typical methods
with a three-level taxonomy for I-CMR. Furthermore, this survey reviews the
existing CMR datasets with annotations for explanations. Finally, this survey
summarizes the challenges for I-CMR and discusses potential future directions.
In conclusion, this survey aims to catalyze the progress of this emerging
research area by providing researchers with a panoramic and comprehensive
perspective, illuminating the state of the art and discerning the
opportunities
Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey
Knowledge Graphs (KGs) play a pivotal role in advancing various AI
applications, with the semantic web community's exploration into multi-modal
dimensions unlocking new avenues for innovation. In this survey, we carefully
review over 300 articles, focusing on KG-aware research in two principal
aspects: KG-driven Multi-Modal (KG4MM) learning, where KGs support multi-modal
tasks, and Multi-Modal Knowledge Graph (MM4KG), which extends KG studies into
the MMKG realm. We begin by defining KGs and MMKGs, then explore their
construction progress. Our review includes two primary task categories:
KG-aware multi-modal learning tasks, such as Image Classification and Visual
Question Answering, and intrinsic MMKG tasks like Multi-modal Knowledge Graph
Completion and Entity Alignment, highlighting specific research trajectories.
For most of these tasks, we provide definitions, evaluation benchmarks, and
additionally outline essential insights for conducting relevant research.
Finally, we discuss current challenges and identify emerging trends, such as
progress in Large Language Modeling and Multi-modal Pre-training strategies.
This survey aims to serve as a comprehensive reference for researchers already
involved in or considering delving into KG and multi-modal learning research,
offering insights into the evolving landscape of MMKG research and supporting
future work.Comment: Ongoing work; 41 pages (Main Text), 55 pages (Total), 11 Tables, 13
Figures, 619 citations; Paper list is available at
https://github.com/zjukg/KG-MM-Surve
- …