1,973 research outputs found
e-SNLI: Natural Language Inference with Natural Language Explanations
In order for machine learning to garner widespread public adoption, models
must be able to provide interpretable and robust explanations for their
decisions, as well as learn from human-provided explanations at train time. In
this work, we extend the Stanford Natural Language Inference dataset with an
additional layer of human-annotated natural language explanations of the
entailment relations. We further implement models that incorporate these
explanations into their training process and output them at test time. We show
how our corpus of explanations, which we call e-SNLI, can be used for various
goals, such as obtaining full sentence justifications of a model's decisions,
improving universal sentence representations and transferring to out-of-domain
NLI datasets. Our dataset thus opens up a range of research directions for
using natural language explanations, both for improving models and for
asserting their trust.Comment: NeurIPS 201
Q2ATransformer: Improving Medical VQA via an Answer Querying Decoder
Medical Visual Question Answering (VQA) systems play a supporting role to
understand clinic-relevant information carried by medical images. The questions
to a medical image include two categories: close-end (such as Yes/No question)
and open-end. To obtain answers, the majority of the existing medical VQA
methods relies on classification approaches, while a few works attempt to use
generation approaches or a mixture of the two. The classification approaches
are relatively simple but perform poorly on long open-end questions. To bridge
this gap, in this paper, we propose a new Transformer based framework for
medical VQA (named as Q2ATransformer), which integrates the advantages of both
the classification and the generation approaches and provides a unified
treatment for the close-end and open-end questions. Specifically, we introduce
an additional Transformer decoder with a set of learnable candidate answer
embeddings to query the existence of each answer class to a given
image-question pair. Through the Transformer attention, the candidate answer
embeddings interact with the fused features of the image-question pair to make
the decision. In this way, despite being a classification-based approach, our
method provides a mechanism to interact with the answer information for
prediction like the generation-based approaches. On the other hand, by
classification, we mitigate the task difficulty by reducing the search space of
answers. Our method achieves new state-of-the-art performance on two medical
VQA benchmarks. Especially, for the open-end questions, we achieve 79.19% on
VQA-RAD and 54.85% on PathVQA, with 16.09% and 41.45% absolute improvements,
respectively
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
In this paper, we focus on the problem of Medical Visual Question Answering
(MedVQA), which is crucial in efficiently interpreting medical images with
vital clinic-relevant information. Firstly, we reframe the problem of MedVQA as
a generation task that naturally follows the human-machine interaction, we
propose a generative-based model for medical visual understanding by aligning
visual information from a pre-trained vision encoder with a large language
model. Secondly, we establish a scalable pipeline to construct a large-scale
medical visual question-answering dataset, named PMC-VQA, which contains 227k
VQA pairs of 149k images that cover various modalities or diseases. Thirdly, we
pre-train our proposed model on PMC-VQA and then fine-tune it on multiple
public benchmarks, e.g., VQA-RAD and SLAKE, outperforming existing work by a
large margin. Additionally, we propose a test set that has undergone manual
verification, which is significantly more challenging, even the best models
struggle to solve
Self-supervised vision-language pretraining for Medical visual question answering
Medical image visual question answering (VQA) is a task to answer clinical
questions, given a radiographic image, which is a challenging problem that
requires a model to integrate both vision and language information. To solve
medical VQA problems with a limited number of training data, pretrain-finetune
paradigm is widely used to improve the model generalization. In this paper, we
propose a self-supervised method that applies Masked image modeling, Masked
language modeling, Image text matching and Image text alignment via contrastive
learning (M2I2) for pretraining on medical image caption dataset, and finetunes
to downstream medical VQA tasks. The proposed method achieves state-of-the-art
performance on all the three public medical VQA datasets. Our codes and models
are available at https://github.com/pengfeiliHEU/M2I2.Comment: 5 pages, 3 figure
A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks
Transformer is a deep neural network that employs a self-attention mechanism
to comprehend the contextual relationships within sequential data. Unlike
conventional neural networks or updated versions of Recurrent Neural Networks
(RNNs) such as Long Short-Term Memory (LSTM), transformer models excel in
handling long dependencies between input sequence elements and enable parallel
processing. As a result, transformer-based models have attracted substantial
interest among researchers in the field of artificial intelligence. This can be
attributed to their immense potential and remarkable achievements, not only in
Natural Language Processing (NLP) tasks but also in a wide range of domains,
including computer vision, audio and speech processing, healthcare, and the
Internet of Things (IoT). Although several survey papers have been published
highlighting the transformer's contributions in specific fields, architectural
differences, or performance evaluations, there is still a significant absence
of a comprehensive survey paper encompassing its major applications across
various domains. Therefore, we undertook the task of filling this gap by
conducting an extensive survey of proposed transformer models from 2017 to
2022. Our survey encompasses the identification of the top five application
domains for transformer-based models, namely: NLP, Computer Vision,
Multi-Modality, Audio and Speech Processing, and Signal Processing. We analyze
the impact of highly influential transformer-based models in these domains and
subsequently classify them based on their respective tasks using a proposed
taxonomy. Our aim is to shed light on the existing potential and future
possibilities of transformers for enthusiastic researchers, thus contributing
to the broader understanding of this groundbreaking technology
CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering
Visual Question Answering (VQA) is a multi-discipline research task. To
produce the right answer, it requires an understanding of the visual content of
images, the natural language questions, as well as commonsense reasoning over
the information contained in the image and world knowledge. Recently,
large-scale Vision-and-Language Pre-trained Models (VLPMs) have been the
mainstream approach to VQA tasks due to their superior performance. The
standard practice is to fine-tune large-scale VLPMs pre-trained on huge
general-domain datasets using the domain-specific VQA datasets. However, in
reality, the application domain can change over time, necessitating VLPMs to
continually learn and adapt to new domains without forgetting previously
acquired knowledge. Most existing continual learning (CL) research concentrates
on unimodal tasks, whereas a more practical application scenario, i.e, CL on
cross-domain VQA, has not been studied. Motivated by this, we introduce
CL-CrossVQA, a rigorous Continual Learning benchmark for Cross-domain Visual
Question Answering, through which we conduct extensive experiments on 4 VLPMs,
4 CL approaches, and 5 VQA datasets from different domains. In addition, by
probing the forgetting phenomenon of the intermediate layers, we provide
insights into how model architecture affects CL performance, why CL approaches
can help mitigate forgetting in VLPMs to some extent, and how to design CL
approaches suitable for VLPMs in this challenging continual learning
environment. To facilitate future work on CL for cross-domain VQA, we will
release our datasets and code.Comment: 10 pages, 6 figure
Recent, rapid advancement in visual question answering architecture: a review
Understanding visual question answering is going to be crucial for numerous
human activities. However, it presents major challenges at the heart of the
artificial intelligence endeavor. This paper presents an update on the rapid
advancements in visual question answering using images that have occurred in
the last couple of years. Tremendous growth in research on improving visual
question answering system architecture has been published recently, showing the
importance of multimodal architectures. Several points on the benefits of
visual question answering are mentioned in the review paper by Manmadhan et al.
(2020), on which the present article builds, including subsequent updates in
the field.Comment: 11 page
- …