30 research outputs found

    Analyzing the Behavior of Visual Question Answering Models

    Full text link
    Recently, a number of deep-learning based models have been proposed for the task of Visual Question Answering (VQA). The performance of most models is clustered around 60-70%. In this paper we propose systematic methods to analyze the behavior of these models as a first step towards recognizing their strengths and weaknesses, and identifying the most fruitful directions for progress. We analyze two models, one each from two major classes of VQA models -- with-attention and without-attention and show the similarities and differences in the behavior of these models. We also analyze the winning entry of the VQA Challenge 2016. Our behavior analysis reveals that despite recent progress, today's VQA models are "myopic" (tend to fail on sufficiently novel instances), often "jump to conclusions" (converge on a predicted answer after 'listening' to just half the question), and are "stubborn" (do not change their answers across images).Comment: 13 pages, 20 figures; To appear in EMNLP 201

    Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

    Full text link
    A number of studies have found that today's Visual Question Answering (VQA) models are heavily driven by superficial correlations in the training data and lack sufficient image grounding. To encourage development of models geared towards the latter, we propose a new setting for VQA where for every question type, train and test sets have different prior distributions of answers. Specifically, we present new splits of the VQA v1 and VQA v2 datasets, which we call Visual Question Answering under Changing Priors (VQA-CP v1 and VQA-CP v2 respectively). First, we evaluate several existing VQA models under this new setting and show that their performance degrades significantly compared to the original VQA setting. Second, we propose a novel Grounded Visual Question Answering model (GVQA) that contains inductive biases and restrictions in the architecture specifically designed to prevent the model from 'cheating' by primarily relying on priors in the training data. Specifically, GVQA explicitly disentangles the recognition of visual concepts present in the image from the identification of plausible answer space for a given question, enabling the model to more robustly generalize across different distributions of answers. GVQA is built off an existing VQA model -- Stacked Attention Networks (SAN). Our experiments demonstrate that GVQA significantly outperforms SAN on both VQA-CP v1 and VQA-CP v2 datasets. Interestingly, it also outperforms more powerful VQA models such as Multimodal Compact Bilinear Pooling (MCB) in several cases. GVQA offers strengths complementary to SAN when trained and evaluated on the original VQA v1 and VQA v2 datasets. Finally, GVQA is more transparent and interpretable than existing VQA models.Comment: 15 pages, 10 figures. To appear in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 201

    An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics

    Full text link
    Recently, reference-free metrics such as CLIPScore (Hessel et al., 2021) and UMIC (Lee et al., 2021) have been proposed for automatic evaluation of image captions, demonstrating a high correlation with human judgment. In this work, our focus lies in evaluating the robustness of these metrics in scenarios that require distinguishing between two captions with high lexical overlap but very different meanings. Our findings reveal that despite their high correlation with human judgment, both CLIPScore and UMIC struggle to identify fine-grained errors in captions. However, when comparing different types of fine-grained errors, both metrics exhibit limited sensitivity to implausibility of captions and strong sensitivity to lack of sufficient visual grounding. Probing further into the visual grounding aspect, we found that both CLIPScore and UMIC are impacted by the size of image-relevant objects mentioned in the caption, and that CLIPScore is also sensitive to the number of mentions of image-relevant objects in the caption. In terms of linguistic aspects of a caption, we found that both metrics lack the ability to comprehend negation, UMIC is sensitive to caption lengths, and CLIPScore is insensitive to the structure of the sentence. We hope our findings will serve as a valuable guide towards improving reference-free evaluation in image captioning

    Teledentistry: A Boon in Indian Scenario

    Get PDF
    An amalgamation of telecommunication and dentistry is known as ‘Teledentistry’, which involves switch over the clinical information in remote areas for diagnosis, consultation, health education and treatment planning. The accessibility of dental care at low cost by all people has ton increased by teledentistry. It also has an immense perspective to overcome the disparities in oral healthcare between rural and urban population. Thus the aim behind to review this article is to establish the essential role of Teledentistry in Indian Scenari. The literature for this review obtained from published articles, online manuals and books

    Multi Party Distributed Private Matching, Set Disjointness and Cardinality Set Intersection with Information Theoretic Security

    Get PDF
    In this paper, we focus on the specific problems of Private Matching, Set Disjointness and Cardinality Set Intersection in information theoretic settings. Specifically, we give perfectly secure protocols for the above problems in n party settings, tolerating a computational ly unbounded semi-honest adversary, who can passively corrupt at most t < n/2 parties. To the best of our knowledge, these are the first such information theoretically secure protocols in a multi-party setting for all three problems. Previous solutions for Distributed Private Matching and Cardinality Set Intersection were cryptographical ly secure and the previous Set Disjointness solution, though information theoretically secure, is in a two party setting. We also propose a new model for Distributed Private matching which is relevant in a multi-party setting