179,982 research outputs found

    VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval

    Full text link
    Relation-focused cross-modal information retrieval focuses on retrieving information based on relations expressed in user queries, and it is particularly important in information retrieval applications and next-generation search engines. While pre-trained networks like Contrastive Language-Image Pre-training (CLIP) have achieved state-of-the-art performance in cross-modal learning tasks, the Vision Transformer (ViT) used in these networks is limited in its ability to focus on image region relations. Specifically, ViT is trained to match images with relevant descriptions at the global level, without considering the alignment between image regions and descriptions. This paper introduces VITR, a novel network that enhances ViT by extracting and reasoning about image region relations based on a Local encoder. VITR comprises two main components: (1) extending the capabilities of ViT-based cross-modal networks to extract and reason with region relations in images; and (2) aggregating the reasoned results with the global knowledge to predict the similarity scores between images and descriptions. Experiments were carried out by applying the proposed network to relation-focused cross-modal information retrieval tasks on the Flickr30K, RefCOCOg, and CLEVR datasets. The results revealed that the proposed VITR network outperformed various other state-of-the-art networks including CLIP, VSE∞\infty, and VSRN++ on both image-to-text and text-to-image cross-modal information retrieval tasks

    Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning

    Full text link
    Visual question answering requires high-order reasoning about an image, which is a fundamental capability needed by machine systems to follow complex directives. Recently, modular networks have been shown to be an effective framework for performing visual reasoning tasks. While modular networks were initially designed with a degree of model transparency, their performance on complex visual reasoning benchmarks was lacking. Current state-of-the-art approaches do not provide an effective mechanism for understanding the reasoning process. In this paper, we close the performance gap between interpretable models and state-of-the-art visual reasoning methods. We propose a set of visual-reasoning primitives which, when composed, manifest as a model capable of performing complex reasoning tasks in an explicitly-interpretable manner. The fidelity and interpretability of the primitives' outputs enable an unparalleled ability to diagnose the strengths and weaknesses of the resulting model. Critically, we show that these primitives are highly performant, achieving state-of-the-art accuracy of 99.1% on the CLEVR dataset. We also show that our model is able to effectively learn generalized representations when provided a small amount of data containing novel object attributes. Using the CoGenT generalization task, we show more than a 20 percentage point improvement over the current state of the art.Comment: CVPR 2018 pre-prin

    Distinct neural substrates of visuospatial and verbal-analytic reasoning as assessed by Raven’s Advanced Progressive Matrices

    Get PDF
    Recent studies revealed spontaneous neural activity to be associated with fluid intelligence (gF) which is commonly assessed by Raven's Advanced Progressive Matrices, and embeds two types of reasoning: visuospatial and verbal-analytic reasoning. With resting-state fMRI data, using global brain connectivity (GBC) analysis which averages functional connectivity of a voxel in relation to all other voxels in the brain, distinct neural correlates of these two reasoning types were found. For visuospatial reasoning, negative correlations were observed in both the primary visual cortex (PVC) and the precuneus, and positive correlations were observed in the temporal lobe. For verbal-analytic reasoning, negative correlations were observed in the right inferior frontal gyrus (rIFG), dorsal anterior cingulate cortex and temporoparietal junction, and positive correlations were observed in the angular gyrus. Furthermore, an interaction between GBC value and type of reasoning was found in the PVC, rIFG and the temporal lobe. These findings suggest that visuospatial reasoning benefits more from elaborate perception to stimulus features, whereas verbal-analytic reasoning benefits more from feature integration and hypothesis testing. In sum, the present study offers, for different types of reasoning in gF, first empirical evidence of separate neural substrates in the resting brain

    Relational Reasoning Network (RRN) for Anatomical Landmarking

    Full text link
    Accurately identifying anatomical landmarks is a crucial step in deformation analysis and surgical planning for craniomaxillofacial (CMF) bones. Available methods require segmentation of the object of interest for precise landmarking. Unlike those, our purpose in this study is to perform anatomical landmarking using the inherent relation of CMF bones without explicitly segmenting them. We propose a new deep network architecture, called relational reasoning network (RRN), to accurately learn the local and the global relations of the landmarks. Specifically, we are interested in learning landmarks in CMF region: mandible, maxilla, and nasal bones. The proposed RRN works in an end-to-end manner, utilizing learned relations of the landmarks based on dense-block units and without the need for segmentation. For a given a few landmarks as input, the proposed system accurately and efficiently localizes the remaining landmarks on the aforementioned bones. For a comprehensive evaluation of RRN, we used cone-beam computed tomography (CBCT) scans of 250 patients. The proposed system identifies the landmark locations very accurately even when there are severe pathologies or deformations in the bones. The proposed RRN has also revealed unique relationships among the landmarks that help us infer several reasoning about informativeness of the landmark points. RRN is invariant to order of landmarks and it allowed us to discover the optimal configurations (number and location) for landmarks to be localized within the object of interest (mandible) or nearby objects (maxilla and nasal). To the best of our knowledge, this is the first of its kind algorithm finding anatomical relations of the objects using deep learning.Comment: 10 pages, 6 Figures, 3 Table
    • …
    corecore