1,556 research outputs found
Visual Commonsense based Heterogeneous Graph Contrastive Learning
How to select relevant key objects and reason about the complex relationships
cross vision and linguistic domain are two key issues in many multi-modality
applications such as visual question answering (VQA). In this work, we
incorporate the visual commonsense information and propose a heterogeneous
graph contrastive learning method to better finish the visual reasoning task.
Our method is designed as a plug-and-play way, so that it can be quickly and
easily combined with a wide range of representative methods. Specifically, our
model contains two key components: the Commonsense-based Contrastive Learning
and the Graph Relation Network. Using contrastive learning, we guide the model
concentrate more on discriminative objects and relevant visual commonsense
attributes. Besides, thanks to the introduction of the Graph Relation Network,
the model reasons about the correlations between homogeneous edges and the
similarities between heterogeneous edges, which makes information transmission
more effective. Extensive experiments on four benchmarks show that our method
greatly improves seven representative VQA models, demonstrating its
effectiveness and generalizability
A survey on knowledge-enhanced multimodal learning
Multimodal learning has been a field of increasing interest, aiming to
combine various modalities in a single joint representation. Especially in the
area of visiolinguistic (VL) learning multiple models and techniques have been
developed, targeting a variety of tasks that involve images and text. VL models
have reached unprecedented performances by extending the idea of Transformers,
so that both modalities can learn from each other. Massive pre-training
procedures enable VL models to acquire a certain level of real-world
understanding, although many gaps can be identified: the limited comprehension
of commonsense, factual, temporal and other everyday knowledge aspects
questions the extendability of VL tasks. Knowledge graphs and other knowledge
sources can fill those gaps by explicitly providing missing information,
unlocking novel capabilities of VL models. In the same time, knowledge graphs
enhance explainability, fairness and validity of decision making, issues of
outermost importance for such complex implementations. The current survey aims
to unify the fields of VL representation learning and knowledge graphs, and
provides a taxonomy and analysis of knowledge-enhanced VL models
Leveraging commonsense for object localisation in partial scenes
We propose an end-to-end solution to address the problem of object
localisation in partial scenes, where we aim to estimate the position of an
object in an unknown area given only a partial 3D scan of the scene. We propose
a novel scene representation to facilitate the geometric reasoning, Directed
Spatial Commonsense Graph (D-SCG), a spatial scene graph that is enriched with
additional concept nodes from a commonsense knowledge base. Specifically, the
nodes of D-SCG represent the scene objects and the edges are their relative
positions. Each object node is then connected via different commonsense
relationships to a set of concept nodes. With the proposed graph-based scene
representation, we estimate the unknown position of the target object using a
Graph Neural Network that implements a novel attentional message passing
mechanism. The network first predicts the relative positions between the target
object and each visible object by learning a rich representation of the objects
via aggregating both the object nodes and the concept nodes in D-SCG. These
relative positions then are merged to obtain the final position. We evaluate
our method using Partial ScanNet, improving the state-of-the-art by 5.9% in
terms of the localisation accuracy at a 8x faster training speed.Comment: arXiv admin note: text overlap with arXiv:2203.0538
The Knowledge Level in Cognitive Architectures: Current Limitations and Possible Developments
In this paper we identify and characterize an analysis of two problematic aspects affecting the representational level of cognitive architectures (CAs), namely: the limited size and the homogeneous typology of the encoded and processed knowledge.
We argue that such aspects may constitute not only a technological problem that, in our opinion, should be addressed in order to build articial agents able to exhibit intelligent behaviours in general scenarios, but also an epistemological one, since they limit the plausibility of the comparison of the CAs' knowledge representation and processing mechanisms with those executed by humans in their everyday activities. In the final part of the paper further directions of research will be explored, trying to address current limitations and
future challenges
- …