3,276 research outputs found
VTKG: A Vision Transformer Model with Integration of Knowledge Graph for Enhanced Image Captioning
The Transformer model has exhibited impressive results in machine translation tasks. In this research, we utilize the Transformer model to improve the performance of image captioning. In this paper, we tackle the image captioning task from a novel sequence-to-sequence perspective and present VTKG, a VisionTransformer model with integrated Knowledge Graph, a comprehensive Transformer network that substitutes the CNN in the encoder section with a convolution-free Transformer encoder. Subsequently, to enhance the generation of meaningful captions and address the issue of mispredictions, we introduce a novel approach to integrate common-sense knowledge extracted from a knowledge graph. This has significantly improved the overall adaptability of our captioning model. Through the amalgamation of the previously mentioned strategies, we attain exceptional performance on multiple established evaluation metrics, outperforming existing benchmarks. Experimental results demonstrate a 1.32%, 1.7%, 1.25%, 1.14%, 2.8% and 2.5% improvement in Blue-1, Bluu-2, Blue-4, Metor, Rough-L and CIDEr score respectively when compared to state-of-the-art methods
Towards automated knowledge-based mapping between individual conceptualisations to empower personalisation of Geospatial Semantic Web
Geospatial domain is characterised by vagueness, especially in the semantic disambiguation of the concepts in the domain, which makes defining universally accepted geo- ontology an onerous task. This is compounded by the lack of appropriate methods and techniques where the individual semantic conceptualisations can be captured and compared to each other. With multiple user conceptualisations, efforts towards a reliable Geospatial Semantic Web, therefore, require personalisation where user diversity can be incorporated. The work presented in this paper is part of our ongoing research on applying commonsense reasoning to elicit and maintain models that represent users' conceptualisations. Such user models will enable taking into account the users' perspective of the real world and will empower personalisation algorithms for the Semantic Web. Intelligent information processing over the Semantic Web can be achieved if different conceptualisations can be integrated in a semantic environment and mismatches between different conceptualisations can be outlined. In this paper, a formal approach for detecting mismatches between a user's and an expert's conceptual model is outlined. The formalisation is used as the basis to develop algorithms to compare models defined in OWL. The algorithms are illustrated in a geographical domain using concepts from the SPACE ontology developed as part of the SWEET suite of ontologies for the Semantic Web by NASA, and are evaluated by comparing test cases of possible user misconceptions
Detecting Mismatches between a User's and an Expert's Conceptualisations
The work presented in this paper is part of our ongoing research on applying commonsense reasoning to elicit and maintain models that represent users' conceptualisations. Such user models will enable taking into account the users' perspective of the world and will empower personalisation algorithms for the Semantic Web. A formal approach for detecting mismatches between a user's and an expert's conceptual model is outlined. The formalisation is used as the basis to develop algorithms to compare two conceptualisations defined in OWL. The algorithms are illustrated in a geographical domain using a space ontology developed at NASA, and have been tested by simulating possible user misconceptions
A survey on knowledge-enhanced multimodal learning
Multimodal learning has been a field of increasing interest, aiming to
combine various modalities in a single joint representation. Especially in the
area of visiolinguistic (VL) learning multiple models and techniques have been
developed, targeting a variety of tasks that involve images and text. VL models
have reached unprecedented performances by extending the idea of Transformers,
so that both modalities can learn from each other. Massive pre-training
procedures enable VL models to acquire a certain level of real-world
understanding, although many gaps can be identified: the limited comprehension
of commonsense, factual, temporal and other everyday knowledge aspects
questions the extendability of VL tasks. Knowledge graphs and other knowledge
sources can fill those gaps by explicitly providing missing information,
unlocking novel capabilities of VL models. In the same time, knowledge graphs
enhance explainability, fairness and validity of decision making, issues of
outermost importance for such complex implementations. The current survey aims
to unify the fields of VL representation learning and knowledge graphs, and
provides a taxonomy and analysis of knowledge-enhanced VL models
Commonsense for Zero-Shot Natural Language Video Localization
Zero-shot Natural Language-Video Localization (NLVL) methods have exhibited
promising results in training NLVL models exclusively with raw video data by
dynamically generating video segments and pseudo-query annotations. However,
existing pseudo-queries often lack grounding in the source video, resulting in
unstructured and disjointed content. In this paper, we investigate the
effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we
present CORONET, a zero-shot NLVL framework that leverages commonsense to
bridge the gap between videos and generated pseudo-queries via a commonsense
enhancement module. CORONET employs Graph Convolution Networks (GCN) to encode
commonsense information extracted from a knowledge graph, conditioned on the
video, and cross-attention mechanisms to enhance the encoded video and
pseudo-query representations prior to localization. Through empirical
evaluations on two benchmark datasets, we demonstrate that CORONET surpasses
both zero-shot and weakly supervised baselines, achieving improvements up to
32.13% across various recall thresholds and up to 6.33% in mIoU. These results
underscore the significance of leveraging commonsense reasoning for zero-shot
NLVL.Comment: Accepted to AAAI 202
Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey
Knowledge Graphs (KGs) play a pivotal role in advancing various AI
applications, with the semantic web community's exploration into multi-modal
dimensions unlocking new avenues for innovation. In this survey, we carefully
review over 300 articles, focusing on KG-aware research in two principal
aspects: KG-driven Multi-Modal (KG4MM) learning, where KGs support multi-modal
tasks, and Multi-Modal Knowledge Graph (MM4KG), which extends KG studies into
the MMKG realm. We begin by defining KGs and MMKGs, then explore their
construction progress. Our review includes two primary task categories:
KG-aware multi-modal learning tasks, such as Image Classification and Visual
Question Answering, and intrinsic MMKG tasks like Multi-modal Knowledge Graph
Completion and Entity Alignment, highlighting specific research trajectories.
For most of these tasks, we provide definitions, evaluation benchmarks, and
additionally outline essential insights for conducting relevant research.
Finally, we discuss current challenges and identify emerging trends, such as
progress in Large Language Modeling and Multi-modal Pre-training strategies.
This survey aims to serve as a comprehensive reference for researchers already
involved in or considering delving into KG and multi-modal learning research,
offering insights into the evolving landscape of MMKG research and supporting
future work.Comment: Ongoing work; 41 pages (Main Text), 55 pages (Total), 11 Tables, 13
Figures, 619 citations; Paper list is available at
https://github.com/zjukg/KG-MM-Surve
Recommended from our members
Sensory semantic user interfaces (SenSUI)
Rapid evolution of the World Wide Web with its underlying sources of data, knowledge, services and applications continually attempts to support a variety of users, with different backgrounds, requirements and capabilities. In such an environment, it is highly unlikely that a single user interface will prevail and be able to fulfill the requirements of each user adequately. Adaptive user interfaces are able to adapt information and application functionalities to the user context. In contrast, pervasive computing and sensor networks open new opportunities for context aware platforms, one that is able to improve user interface adaptation reacting to environmental and user sensors. Semantic web technologies and ontologies are able to capture sensor data and provide contextual information about the user, their actions, required applications and environment. This paper investigates the viability of an approach where semantic web technologies are used to maximize the efficacy of interface adaptation through the use of available ontology
- …