48,736 research outputs found
Focal Visual-Text Attention for Visual Question Answering
Recent insights on language and vision with neural networks have been
successfully applied to simple single-image visual question answering. However,
to tackle real-life question answering problems on multimedia collections such
as personal photos, we have to look at whole collections with sequences of
photos or videos. When answering questions from a large collection, a natural
problem is to identify snippets to support the answer. In this paper, we
describe a novel neural network called Focal Visual-Text Attention network
(FVTA) for collective reasoning in visual question answering, where both visual
and text sequence information such as images and text metadata are presented.
FVTA introduces an end-to-end approach that makes use of a hierarchical process
to dynamically determine what media and what time to focus on in the sequential
data to answer the question. FVTA can not only answer the questions well but
also provides the justifications which the system results are based upon to get
the answers. FVTA achieves state-of-the-art performance on the MemexQA dataset
and competitive results on the MovieQA dataset.Comment: In CVPR 2018. Code, models and dataset are available here:
https://memexqa.cs.cmu.edu
Progressive Attention Memory Network for Movie Story Question Answering
This paper proposes the progressive attention memory network (PAMN) for movie
story question answering (QA). Movie story QA is challenging compared to VQA in
two aspects: (1) pinpointing the temporal parts relevant to answer the question
is difficult as the movies are typically longer than an hour, (2) it has both
video and subtitle where different questions require different modality to
infer the answer. To overcome these challenges, PAMN involves three main
features: (1) progressive attention mechanism that utilizes cues from both
question and answer to progressively prune out irrelevant temporal parts in
memory, (2) dynamic modality fusion that adaptively determines the contribution
of each modality for answering the current question, and (3) belief correction
answering scheme that successively corrects the prediction score on each
candidate answer. Experiments on publicly available benchmark datasets, MovieQA
and TVQA, demonstrate that each feature contributes to our movie story QA
architecture, PAMN, and improves performance to achieve the state-of-the-art
result. Qualitative analysis by visualizing the inference mechanism of PAMN is
also provided.Comment: CVPR 2019, Accepte
Holistic Multi-modal Memory Network for Movie Question Answering
Answering questions according to multi-modal context is a challenging problem
as it requires a deep integration of different data sources. Existing
approaches only employ partial interactions among data sources in one attention
hop. In this paper, we present the Holistic Multi-modal Memory Network (HMMN)
framework which fully considers the interactions between different input
sources (multi-modal context, question) in each hop. In addition, it takes
answer choices into consideration during the context retrieval stage.
Therefore, the proposed framework effectively integrates multi-modal context,
question, and answer information, which leads to more informative context
retrieved for question answering. Our HMMN framework achieves state-of-the-art
accuracy on MovieQA dataset. Extensive ablation studies show the importance of
holistic reasoning and contributions of different attention strategies
Interpretable Counting for Visual Question Answering
Questions that require counting a variety of objects in images remain a major
challenge in visual question answering (VQA). The most common approaches to VQA
involve either classifying answers based on fixed length representations of
both the image and question or summing fractional counts estimated from each
section of the image. In contrast, we treat counting as a sequential decision
process and force our model to make discrete choices of what to count.
Specifically, the model sequentially selects from detected objects and learns
interactions between objects that influence subsequent selections. A
distinction of our approach is its intuitive and interpretable output, as
discrete counts are automatically grounded in the image. Furthermore, our
method outperforms the state of the art architecture for VQA on multiple
metrics that evaluate counting.Comment: ICLR 201
ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases
The chest X-ray is one of the most commonly accessible radiological
examinations for screening and diagnosis of many lung diseases. A tremendous
number of X-ray imaging studies accompanied by radiological reports are
accumulated and stored in many modern hospitals' Picture Archiving and
Communication Systems (PACS). On the other side, it is still an open question
how this type of hospital-size knowledge database containing invaluable imaging
informatics (i.e., loosely labeled) can be used to facilitate the data-hungry
deep learning paradigms in building truly large-scale high precision
computer-aided diagnosis (CAD) systems.
In this paper, we present a new chest X-ray database, namely "ChestX-ray8",
which comprises 108,948 frontal-view X-ray images of 32,717 unique patients
with the text-mined eight disease image labels (where each image can have
multi-labels), from the associated radiological reports using natural language
processing. Importantly, we demonstrate that these commonly occurring thoracic
diseases can be detected and even spatially-located via a unified
weakly-supervised multi-label image classification and disease localization
framework, which is validated using our proposed dataset. Although the initial
quantitative results are promising as reported, deep convolutional neural
network based "reading chest X-rays" (i.e., recognizing and locating the common
disease patterns trained with only image-level labels) remains a strenuous task
for fully-automated high precision CAD systems. Data download link:
https://nihcc.app.box.com/v/ChestXray-NIHCCComment: CVPR 2017 spotlight;V1: CVPR submission+supplementary; V2: Statistics
and benchmark results on published ChestX-ray14 dataset are updated in
Appendix B V3: Minor correction V4: new data download link upated:
https://nihcc.app.box.com/v/ChestXray-NIHCC V5: Update benchmark results on
the published data split in the appendi
Revisiting EmbodiedQA: A Simple Baseline and Beyond
In Embodied Question Answering (EmbodiedQA), an agent interacts with an
environment to gather necessary information for answering user questions.
Existing works have laid a solid foundation towards solving this interesting
problem. But the current performance, especially in navigation, suggests that
EmbodiedQA might be too challenging for the contemporary approaches. In this
paper, we empirically study this problem and introduce 1) a simple yet
effective baseline that achieves promising performance; 2) an easier and
practical setting for EmbodiedQA where an agent has a chance to adapt the
trained model to a new environment before it actually answers users questions.
In this new setting, we randomly place a few objects in new environments, and
upgrade the agent policy by a distillation network to retain the generalization
ability from the trained model. On the EmbodiedQA v1 benchmark, under the
standard setting, our simple baseline achieves very competitive results to
the-state-of-the-art; in the new setting, we found the introduced small change
in settings yields a notable gain in navigation.Comment: Accepted to IEEE Transactions on Image Processing (TIP
MCQA: Multimodal Co-attention Based Network for Question Answering
We present MCQA, a learning-based algorithm for multimodal question
answering. MCQA explicitly fuses and aligns the multimodal input (i.e. text,
audio, and video), which forms the context for the query (question and answer).
Our approach fuses and aligns the question and the answer within this context.
Moreover, we use the notion of co-attention to perform cross-modal alignment
and multimodal context-query alignment. Our context-query alignment module
matches the relevant parts of the multimodal context and the query with each
other and aligns them to improve the overall performance. We evaluate the
performance of MCQA on Social-IQ, a benchmark dataset for multimodal question
answering. We compare the performance of our algorithm with prior methods and
observe an accuracy improvement of 4-7%
Visual Question Answering using Deep Learning: A Survey and Performance Analysis
The Visual Question Answering (VQA) task combines challenges for processing
data with both Visual and Linguistic processing, to answer basic `common sense'
questions about given images. Given an image and a question in natural
language, the VQA system tries to find the correct answer to it using visual
elements of the image and inference gathered from textual questions. In this
survey, we cover and discuss the recent datasets released in the VQA domain
dealing with various types of question-formats and robustness of the
machine-learning models. Next, we discuss about new deep learning models that
have shown promising results over the VQA datasets. At the end, we present and
discuss some of the results computed by us over the vanilla VQA model, Stacked
Attention Network and the VQA Challenge 2017 winner model. We also provide the
detailed analysis along with the challenges and future research directions.Comment: Accepted in Fifth IAPR International Conference on Computer Vision
and Image Processing (CVIP), 202
Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a Class-imbalance View
Recent studies have pointed out that many well-developed Visual Question
Answering (VQA) models are heavily affected by the language prior problem,
which refers to making predictions based on the co-occurrence pattern between
textual questions and answers instead of reasoning visual contents. To tackle
it, most existing methods focus on enhancing visual feature learning to reduce
this superficial textual shortcut influence on VQA model decisions. However,
limited effort has been devoted to providing an explicit interpretation for its
inherent cause. It thus lacks a good guidance for the research community to
move forward in a purposeful way, resulting in model construction perplexity in
overcoming this non-trivial problem. In this paper, we propose to interpret the
language prior problem in VQA from a class-imbalance view. Concretely, we
design a novel interpretation scheme whereby the loss of mis-predicted frequent
and sparse answers of the same question type is distinctly exhibited during the
late training phase. It explicitly reveals why the VQA model tends to produce a
frequent yet obviously wrong answer, to a given question whose right answer is
sparse in the training set. Based upon this observation, we further develop a
novel loss re-scaling approach to assign different weights to each answer based
on the training data statistics for computing the final loss. We apply our
approach into three baselines and the experimental results on two VQA-CP
benchmark datasets evidently demonstrate its effectiveness. In addition, we
also justify the validity of the class imbalance interpretation scheme on other
computer vision tasks, such as face recognition and image classification
A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports
Joint image-text embedding extracted from medical images and associated
contextual reports is the bedrock for most biomedical vision-and-language (V+L)
tasks, including medical visual question answering, clinical image-text
retrieval, clinical report auto-generation. In this study, we adopt four
pre-trained V+L models: LXMERT, VisualBERT, UNIER and PixelBERT to learn
multimodal representation from MIMIC-CXR radiographs and associated reports.
The extrinsic evaluation on OpenI dataset shows that in comparison to the
pioneering CNN-RNN model, the joint embedding learned by pre-trained V+L models
demonstrate performance improvement in the thoracic findings classification
task. We conduct an ablation study to analyze the contribution of certain model
components and validate the advantage of joint embedding over text-only
embedding. We also visualize attention maps to illustrate the attention
mechanism of V+L models.Comment: 10 pages, 3 figures, submitted to BIBM202
- …