6 research outputs found
Decoding Brain Representations by Multimodal Learning of Neural Activity and Visual Features
This work presents a novel method of exploring human brain-visual
representations, with a view towards replicating these processes in machines.
The core idea is to learn plausible computational and biological
representations by correlating human neural activity and natural images. Thus,
we first propose a model, EEG-ChannelNet, to learn a brain manifold for EEG
classification. After verifying that visual information can be extracted from
EEG data, we introduce a multimodal approach that uses deep image and EEG
encoders, trained in a siamese configuration, for learning a joint manifold
that maximizes a compatibility measure between visual features and brain
representations. We then carry out image classification and saliency detection
on the learned manifold. Performance analyses show that our approach
satisfactorily decodes visual information from neural signals. This, in turn,
can be used to effectively supervise the training of deep learning models, as
demonstrated by the high performance of image classification and saliency
detection on out-of-training classes. The obtained results show that the
learned brain-visual features lead to improved performance and simultaneously
bring deep models more in line with cognitive neuroscience work related to
visual perception and attention
Reciprocal Attention Fusion for Visual Question Answering
Existing attention mechanisms either attend to local image grid or object
level features for Visual Question Answering (VQA). Motivated by the
observation that questions can relate to both object instances and their parts,
we propose a novel attention mechanism that jointly considers reciprocal
relationships between the two levels of visual details. The bottom-up attention
thus generated is further coalesced with the top-down information to only focus
on the scene elements that are most relevant to a given question. Our design
hierarchically fuses multi-modal information i.e., language, object- and
gird-level features, through an efficient tensor decomposition scheme. The
proposed model improves the state-of-the-art single model performances from
67.9% to 68.2% on VQAv1 and from 65.7% to 67.4% on VQAv2, demonstrating a
significant boost.Comment: To appear in the British Machine Vision Conference (BMVC), September
201
Learning Conditioned Graph Structures for Interpretable Visual Question Answering
Visual Question answering is a challenging problem requiring a combination of
concepts from Computer Vision and Natural Language Processing. Most existing
approaches use a two streams strategy, computing image and question features
that are consequently merged using a variety of techniques. Nonetheless, very
few rely on higher level image representations, which can capture semantic and
spatial relationships. In this paper, we propose a novel graph-based approach
for Visual Question Answering. Our method combines a graph learner module,
which learns a question specific graph representation of the input image, with
the recent concept of graph convolutions, aiming to learn image representations
that capture question specific interactions. We test our approach on the VQA v2
dataset using a simple baseline architecture enhanced by the proposed graph
learner module. We obtain promising results with 66.18% accuracy and
demonstrate the interpretability of the proposed method. Code can be found at
github.com/aimbrain/vqa-project.Comment: NIPS 2018 (13 pages, 7 figures
Component Analysis for Visual Question Answering Architectures
Recent research advances in Computer Vision and Natural Language Processing
have introduced novel tasks that are paving the way for solving AI-complete
problems. One of those tasks is called Visual Question Answering (VQA). A VQA
system must take an image and a free-form, open-ended natural language question
about the image, and produce a natural language answer as the output. Such a
task has drawn great attention from the scientific community, which generated a
plethora of approaches that aim to improve the VQA predictive accuracy. Most of
them comprise three major components: (i) independent representation learning
of images and questions; (ii) feature fusion so the model can use information
from both sources to answer visual questions; and (iii) the generation of the
correct answer in natural language. With so many approaches being recently
introduced, it became unclear the real contribution of each component for the
ultimate performance of the model. The main goal of this paper is to provide a
comprehensive analysis regarding the impact of each component in VQA models.
Our extensive set of experiments cover both visual and textual elements, as
well as the combination of these representations in form of fusion and
attention mechanisms. Our major contribution is to identify core components for
training VQA models so as to maximize their predictive performance
Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis Tool
In recent years, visual question answering (VQA) has become topical. The
premise of VQA's significance as a benchmark in AI, is that both the image and
textual question need to be well understood and mutually grounded in order to
infer the correct answer. However, current VQA models perhaps `understand' less
than initially hoped, and instead master the easier task of exploiting cues
given away in the question and biases in the answer distribution. In this paper
we propose the inverse problem of VQA (iVQA). The iVQA task is to generate a
question that corresponds to a given image and answer pair. We propose a
variational iVQA model that can generate diverse, grammatically correct and
content correlated questions that match the given answer. Based on this model,
we show that iVQA is an interesting benchmark for visuo-linguistic
understanding, and a more challenging alternative to VQA because an iVQA model
needs to understand the image better to be successful. As a second
contribution, we show how to use iVQA in a novel reinforcement learning
framework to diagnose any existing VQA model by way of exposing its belief set:
the set of question-answer pairs that the VQA model would predict true for a
given image. This provides a completely new window into what VQA models
`believe' about images. We show that existing VQA models have more erroneous
beliefs than previously thought, revealing their intrinsic weaknesses.
Suggestions are then made on how to address these weaknesses going forward.Comment: arXiv admin note: text overlap with arXiv:1710.0337
Multimodal Categorization of Crisis Events in Social Media
Recent developments in image classification and natural language processing,
coupled with the rapid growth in social media usage, have enabled fundamental
advances in detecting breaking events around the world in real-time. Emergency
response is one such area that stands to gain from these advances. By
processing billions of texts and images a minute, events can be automatically
detected to enable emergency response workers to better assess rapidly evolving
situations and deploy resources accordingly. To date, most event detection
techniques in this area have focused on image-only or text-only approaches,
limiting detection performance and impacting the quality of information
delivered to crisis response teams. In this paper, we present a new multimodal
fusion method that leverages both images and texts as input. In particular, we
introduce a cross-attention module that can filter uninformative and misleading
components from weak modalities on a sample by sample basis. In addition, we
employ a multimodal graph-based approach to stochastically transition between
embeddings of different multimodal pairs during training to better regularize
the learning process as well as dealing with limited training data by
constructing new matched pairs from different samples. We show that our method
outperforms the unimodal approaches and strong multimodal baselines by a large
margin on three crisis-related tasks.Comment: Conference on Computer Vision and Pattern Recognition (CVPR 2020