180,021 research outputs found
VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval
Relation-focused cross-modal information retrieval focuses on retrieving
information based on relations expressed in user queries, and it is
particularly important in information retrieval applications and
next-generation search engines. While pre-trained networks like Contrastive
Language-Image Pre-training (CLIP) have achieved state-of-the-art performance
in cross-modal learning tasks, the Vision Transformer (ViT) used in these
networks is limited in its ability to focus on image region relations.
Specifically, ViT is trained to match images with relevant descriptions at the
global level, without considering the alignment between image regions and
descriptions. This paper introduces VITR, a novel network that enhances ViT by
extracting and reasoning about image region relations based on a Local encoder.
VITR comprises two main components: (1) extending the capabilities of ViT-based
cross-modal networks to extract and reason with region relations in images; and
(2) aggregating the reasoned results with the global knowledge to predict the
similarity scores between images and descriptions. Experiments were carried out
by applying the proposed network to relation-focused cross-modal information
retrieval tasks on the Flickr30K, RefCOCOg, and CLEVR datasets. The results
revealed that the proposed VITR network outperformed various other
state-of-the-art networks including CLIP, VSE, and VSRN++ on both
image-to-text and text-to-image cross-modal information retrieval tasks
Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning
Visual question answering requires high-order reasoning about an image, which
is a fundamental capability needed by machine systems to follow complex
directives. Recently, modular networks have been shown to be an effective
framework for performing visual reasoning tasks. While modular networks were
initially designed with a degree of model transparency, their performance on
complex visual reasoning benchmarks was lacking. Current state-of-the-art
approaches do not provide an effective mechanism for understanding the
reasoning process. In this paper, we close the performance gap between
interpretable models and state-of-the-art visual reasoning methods. We propose
a set of visual-reasoning primitives which, when composed, manifest as a model
capable of performing complex reasoning tasks in an explicitly-interpretable
manner. The fidelity and interpretability of the primitives' outputs enable an
unparalleled ability to diagnose the strengths and weaknesses of the resulting
model. Critically, we show that these primitives are highly performant,
achieving state-of-the-art accuracy of 99.1% on the CLEVR dataset. We also show
that our model is able to effectively learn generalized representations when
provided a small amount of data containing novel object attributes. Using the
CoGenT generalization task, we show more than a 20 percentage point improvement
over the current state of the art.Comment: CVPR 2018 pre-prin
Distinct neural substrates of visuospatial and verbal-analytic reasoning as assessed by Raven’s Advanced Progressive Matrices
Recent studies revealed spontaneous neural activity to be associated with fluid intelligence (gF) which is commonly assessed by Raven's Advanced Progressive Matrices, and embeds two types of reasoning: visuospatial and verbal-analytic reasoning. With resting-state fMRI data, using global brain connectivity (GBC) analysis which averages functional connectivity of a voxel in relation to all other voxels in the brain, distinct neural correlates of these two reasoning types were found. For visuospatial reasoning, negative correlations were observed in both the primary visual cortex (PVC) and the precuneus, and positive correlations were observed in the temporal lobe. For verbal-analytic reasoning, negative correlations were observed in the right inferior frontal gyrus (rIFG), dorsal anterior cingulate cortex and temporoparietal junction, and positive correlations were observed in the angular gyrus. Furthermore, an interaction between GBC value and type of reasoning was found in the PVC, rIFG and the temporal lobe. These findings suggest that visuospatial reasoning benefits more from elaborate perception to stimulus features, whereas verbal-analytic reasoning benefits more from feature integration and hypothesis testing. In sum, the present study offers, for different types of reasoning in gF, first empirical evidence of separate neural substrates in the resting brain
Relational Reasoning Network (RRN) for Anatomical Landmarking
Accurately identifying anatomical landmarks is a crucial step in deformation
analysis and surgical planning for craniomaxillofacial (CMF) bones. Available
methods require segmentation of the object of interest for precise landmarking.
Unlike those, our purpose in this study is to perform anatomical landmarking
using the inherent relation of CMF bones without explicitly segmenting them. We
propose a new deep network architecture, called relational reasoning network
(RRN), to accurately learn the local and the global relations of the landmarks.
Specifically, we are interested in learning landmarks in CMF region: mandible,
maxilla, and nasal bones. The proposed RRN works in an end-to-end manner,
utilizing learned relations of the landmarks based on dense-block units and
without the need for segmentation. For a given a few landmarks as input, the
proposed system accurately and efficiently localizes the remaining landmarks on
the aforementioned bones. For a comprehensive evaluation of RRN, we used
cone-beam computed tomography (CBCT) scans of 250 patients. The proposed system
identifies the landmark locations very accurately even when there are severe
pathologies or deformations in the bones. The proposed RRN has also revealed
unique relationships among the landmarks that help us infer several reasoning
about informativeness of the landmark points. RRN is invariant to order of
landmarks and it allowed us to discover the optimal configurations (number and
location) for landmarks to be localized within the object of interest
(mandible) or nearby objects (maxilla and nasal). To the best of our knowledge,
this is the first of its kind algorithm finding anatomical relations of the
objects using deep learning.Comment: 10 pages, 6 Figures, 3 Table
- …