30 research outputs found
Analyzing the Behavior of Visual Question Answering Models
Recently, a number of deep-learning based models have been proposed for the
task of Visual Question Answering (VQA). The performance of most models is
clustered around 60-70%. In this paper we propose systematic methods to analyze
the behavior of these models as a first step towards recognizing their
strengths and weaknesses, and identifying the most fruitful directions for
progress. We analyze two models, one each from two major classes of VQA models
-- with-attention and without-attention and show the similarities and
differences in the behavior of these models. We also analyze the winning entry
of the VQA Challenge 2016.
Our behavior analysis reveals that despite recent progress, today's VQA
models are "myopic" (tend to fail on sufficiently novel instances), often "jump
to conclusions" (converge on a predicted answer after 'listening' to just half
the question), and are "stubborn" (do not change their answers across images).Comment: 13 pages, 20 figures; To appear in EMNLP 201
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
A number of studies have found that today's Visual Question Answering (VQA)
models are heavily driven by superficial correlations in the training data and
lack sufficient image grounding. To encourage development of models geared
towards the latter, we propose a new setting for VQA where for every question
type, train and test sets have different prior distributions of answers.
Specifically, we present new splits of the VQA v1 and VQA v2 datasets, which we
call Visual Question Answering under Changing Priors (VQA-CP v1 and VQA-CP v2
respectively). First, we evaluate several existing VQA models under this new
setting and show that their performance degrades significantly compared to the
original VQA setting. Second, we propose a novel Grounded Visual Question
Answering model (GVQA) that contains inductive biases and restrictions in the
architecture specifically designed to prevent the model from 'cheating' by
primarily relying on priors in the training data. Specifically, GVQA explicitly
disentangles the recognition of visual concepts present in the image from the
identification of plausible answer space for a given question, enabling the
model to more robustly generalize across different distributions of answers.
GVQA is built off an existing VQA model -- Stacked Attention Networks (SAN).
Our experiments demonstrate that GVQA significantly outperforms SAN on both
VQA-CP v1 and VQA-CP v2 datasets. Interestingly, it also outperforms more
powerful VQA models such as Multimodal Compact Bilinear Pooling (MCB) in
several cases. GVQA offers strengths complementary to SAN when trained and
evaluated on the original VQA v1 and VQA v2 datasets. Finally, GVQA is more
transparent and interpretable than existing VQA models.Comment: 15 pages, 10 figures. To appear in IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 201
An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics
Recently, reference-free metrics such as CLIPScore (Hessel et al., 2021) and
UMIC (Lee et al., 2021) have been proposed for automatic evaluation of image
captions, demonstrating a high correlation with human judgment. In this work,
our focus lies in evaluating the robustness of these metrics in scenarios that
require distinguishing between two captions with high lexical overlap but very
different meanings. Our findings reveal that despite their high correlation
with human judgment, both CLIPScore and UMIC struggle to identify fine-grained
errors in captions. However, when comparing different types of fine-grained
errors, both metrics exhibit limited sensitivity to implausibility of captions
and strong sensitivity to lack of sufficient visual grounding. Probing further
into the visual grounding aspect, we found that both CLIPScore and UMIC are
impacted by the size of image-relevant objects mentioned in the caption, and
that CLIPScore is also sensitive to the number of mentions of image-relevant
objects in the caption. In terms of linguistic aspects of a caption, we found
that both metrics lack the ability to comprehend negation, UMIC is sensitive to
caption lengths, and CLIPScore is insensitive to the structure of the sentence.
We hope our findings will serve as a valuable guide towards improving
reference-free evaluation in image captioning
Teledentistry: A Boon in Indian Scenario
An amalgamation of telecommunication and dentistry is known as ‘Teledentistry’, which involves switch over the clinical information in remote areas for diagnosis, consultation, health education and treatment planning. The accessibility of dental care at low cost by all people has ton increased by teledentistry. It also has an immense perspective to overcome the disparities in oral healthcare between rural and urban population. Thus the aim behind to review this article is to establish the essential role of Teledentistry in Indian Scenari. The literature for this review obtained from published articles, online manuals and books
Multi Party Distributed Private Matching, Set Disjointness and Cardinality Set Intersection with Information Theoretic Security
In this paper, we focus on the specific problems of Private Matching, Set Disjointness and Cardinality Set Intersection in information theoretic settings. Specifically, we give perfectly secure protocols
for the above problems in n party settings, tolerating a computational ly unbounded semi-honest adversary, who can passively corrupt at most t < n/2 parties. To the best of our knowledge, these are the first such
information theoretically secure protocols in a multi-party setting for all three problems. Previous solutions for Distributed Private Matching and Cardinality Set Intersection were cryptographical ly secure and the
previous Set Disjointness solution, though information theoretically secure, is in a two party setting. We also propose a new model for Distributed Private matching which is relevant in a multi-party setting