Search CORE

159,505 research outputs found

Learning Convolutional Text Representations for Visual Question Answering

Author: Ji Shuiwang
Wang Zhengyang
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 18/04/2018
Field of study

Visual question answering is a recently proposed artificial intelligence task that requires a deep understanding of both images and texts. In deep learning, images are typically modeled through convolutional neural networks, and texts are typically modeled through recurrent neural networks. While the requirement for modeling images is similar to traditional computer vision tasks, such as object recognition and image classification, visual question answering raises a different need for textual representation as compared to other natural language processing tasks. In this work, we perform a detailed analysis on natural language questions in visual question answering. Based on the analysis, we propose to rely on convolutional neural networks for learning textual representations. By exploring the various properties of convolutional neural networks specialized for text data, such as width and depth, we present our "CNN Inception + Gate" model. We show that our model improves question representations and thus the overall accuracy of visual question answering models. We also show that the text representation requirement in visual question answering is more complicated and comprehensive than that in conventional natural language processing tasks, making it a better task to evaluate textual representation methods. Shallow models like fastText, which can obtain comparable results with deep learning models in tasks like text classification, are not suitable in visual question answering.Comment: Conference paper at SDM 2018. https://github.com/divelab/sva

arXiv.org e-Print Archive

Crossref

Generating Natural Questions About an Image

Author: Devlin Jacob
He Xiaodong
Misra Ishan
Mitchell Margaret
Mostafazadeh Nasrin
Vanderwende Lucy
Publication venue
Publication date: 01/01/2016
Field of study

There has been an explosion of work in the vision & language community during the past few years from image captioning to video transcription, and answering questions about images. These tasks have focused on literal descriptions of the image. To move beyond the literal, we choose to explore how questions about an image are often directed at commonsense inference and the abstract events evoked by objects in the image. In this paper, we introduce the novel task of Visual Question Generation (VQG), where the system is tasked with asking a natural and engaging question when shown an image. We provide three datasets which cover a variety of images from object-centric to event-centric, with considerably more abstract training data than provided to state-of-the-art captioning systems thus far. We train and test several generative and retrieval models to tackle the task of VQG. Evaluation results show that while such models ask reasonable questions for a variety of images, there is still a wide gap with human performance which motivates further work on connecting images with commonsense knowledge and pragmatics. Our proposed task offers a new challenge to the community which we hope furthers interest in exploring deeper connections between vision & language.Comment: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistic

arXiv.org e-Print Archive

Crossref

Flamingo: a visual language model for few-shot learning

Author: Alayrac Jean-Baptiste
Barr Iain
Barreira Ricardo
Binkowski Mikolaj
Borgeaud Sebastian
Brock Andrew
Cabi Serkan
Donahue Jeff
Gong Zhitao
Han Tengda
Hasson Yana
Lenc Karel
Luc Pauline
Menick Jacob
Mensch Arthur
Miech Antoine
Millican Katie
Monteiro Marianne
Nematzadeh Aida
Reynolds Malcolm
Ring Roman
Rutherford Eliza
Samangooei Sina
Sharifzadeh Sahand
Simonyan Karen
Vinyals Oriol
Zisserman Andrew
Publication venue: NeurIPS Proceedings
Publication date: 15/11/2022
Field of study

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data

arXiv.org e-Print Archive

Oxford University Research Archive