10,402 research outputs found

    Visual Entailment: A Novel Task for Fine-Grained Image Understanding

    Get PDF
    Existing visual reasoning datasets such as Visual Question Answering (VQA), often suffer from biases conditioned on the question, image or answer distributions. The recently proposed CLEVR dataset addresses these limitations and requires fine-grained reasoning but the dataset is synthetic and consists of similar objects and sentence structures across the dataset. In this paper, we introduce a new inference task, Visual Entailment (VE) - consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal of a trained VE model is to predict whether the image semantically entails the text. To realize this task, we build a dataset SNLI-VE based on the Stanford Natural Language Inference corpus and Flickr30k dataset. We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task. EVE achieves up to 71% accuracy and outperforms several other state-of-the-art VQA based models. Finally, we demonstrate the explainability of EVE through cross-modal attention visualizations. The SNLI-VE dataset is publicly available at https://github.com/ necla-ml/SNLI-VE

    When Causal Intervention Meets Adversarial Examples and Image Masking for Deep Neural Networks

    Full text link
    Discovering and exploiting the causality in deep neural networks (DNNs) are crucial challenges for understanding and reasoning causal effects (CE) on an explainable visual model. "Intervention" has been widely used for recognizing a causal relation ontologically. In this paper, we propose a causal inference framework for visual reasoning via do-calculus. To study the intervention effects on pixel-level features for causal reasoning, we introduce pixel-wise masking and adversarial perturbation. In our framework, CE is calculated using features in a latent space and perturbed prediction from a DNN-based model. We further provide the first look into the characteristics of discovered CE of adversarially perturbed images generated by gradient-based methods \footnote{~~https://github.com/jjaacckkyy63/Causal-Intervention-AE-wAdvImg}. Experimental results show that CE is a competitive and robust index for understanding DNNs when compared with conventional methods such as class-activation mappings (CAMs) on the Chest X-Ray-14 dataset for human-interpretable feature(s) (e.g., symptom) reasoning. Moreover, CE holds promises for detecting adversarial examples as it possesses distinct characteristics in the presence of adversarial perturbations.Comment: Noted our camera-ready version has changed the title. "When Causal Intervention Meets Adversarial Examples and Image Masking for Deep Neural Networks" as the v3 official paper title in IEEE Proceeding. Please use it in your formal reference. Accepted at IEEE ICIP 2019. Pytorch code has released on https://github.com/jjaacckkyy63/Causal-Intervention-AE-wAdvIm

    Visual Entailment Task for Visually-Grounded Language Learning

    Get PDF
    We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.Comment: 4 pages, accepted by Visually Grounded Interaction and Language (ViGIL) workshop in NeurIPS 201
    corecore