43,253 research outputs found

    Visual Question Answering: A Survey of Methods and Datasets

    Full text link
    Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. Given an image and a question in natural language, it requires reasoning over visual elements of the image and general knowledge to infer the correct answer. In the first part of this survey, we examine the state of the art by comparing modern approaches to the problem. We classify methods by their mechanism to connect the visual and textual modalities. In particular, we examine the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space. We also discuss memory-augmented and modular architectures that interface with structured knowledge bases. In the second part of this survey, we review the datasets available for training and evaluating VQA systems. The various datatsets contain questions at different levels of complexity, which require different capabilities and types of reasoning. We examine in depth the question/answer pairs from the Visual Genome project, and evaluate the relevance of the structured annotations of images with scene graphs for VQA. Finally, we discuss promising future directions for the field, in particular the connection to structured knowledge bases and the use of natural language processing models.Comment: 25 page

    Reading Scene Text in Deep Convolutional Sequences

    Full text link
    We develop a Deep-Text Recurrent Network (DTRN) that regards scene text reading as a sequence labelling problem. We leverage recent advances of deep convolutional neural networks to generate an ordered high-level sequence from a whole word image, avoiding the difficult character segmentation problem. Then a deep recurrent model, building on long short-term memory (LSTM), is developed to robustly recognize the generated CNN sequences, departing from most existing approaches recognising each character independently. Our model has a number of appealing properties in comparison to existing scene text recognition methods: (i) It can recognise highly ambiguous words by leveraging meaningful context information, allowing it to work reliably without either pre- or post-processing; (ii) the deep CNN feature is robust to various image distortions; (iii) it retains the explicit order information in word image, which is essential to discriminate word strings; (iv) the model does not depend on pre-defined dictionary, and it can process unknown words and arbitrary strings. Codes for the DTRN will be available.Comment: To appear in the 13th AAAI Conference on Artificial Intelligence (AAAI-16), 201

    Deep Interactive Region Segmentation and Captioning

    Full text link
    With recent innovations in dense image captioning, it is now possible to describe every object of the scene with a caption while objects are determined by bounding boxes. However, interpretation of such an output is not trivial due to the existence of many overlapping bounding boxes. Furthermore, in current captioning frameworks, the user is not able to involve personal preferences to exclude out of interest areas. In this paper, we propose a novel hybrid deep learning architecture for interactive region segmentation and captioning where the user is able to specify an arbitrary region of the image that should be processed. To this end, a dedicated Fully Convolutional Network (FCN) named Lyncean FCN (LFCN) is trained using our special training data to isolate the User Intention Region (UIR) as the output of an efficient segmentation. In parallel, a dense image captioning model is utilized to provide a wide variety of captions for that region. Then, the UIR will be explained with the caption of the best match bounding box. To the best of our knowledge, this is the first work that provides such a comprehensive output. Our experiments show the superiority of the proposed approach over state-of-the-art interactive segmentation methods on several well-known datasets. In addition, replacement of the bounding boxes with the result of the interactive segmentation leads to a better understanding of the dense image captioning output as well as accuracy enhancement for the object detection in terms of Intersection over Union (IoU).Comment: 17, pages, 9 figure

    Visual pathways from the perspective of cost functions and multi-task deep neural networks

    Get PDF
    Vision research has been shaped by the seminal insight that we can understand the higher-tier visual cortex from the perspective of multiple functional pathways with different goals. In this paper, we try to give a computational account of the functional organization of this system by reasoning from the perspective of multi-task deep neural networks. Machine learning has shown that tasks become easier to solve when they are decomposed into subtasks with their own cost function. We hypothesize that the visual system optimizes multiple cost functions of unrelated tasks and this causes the emergence of a ventral pathway dedicated to vision for perception, and a dorsal pathway dedicated to vision for action. To evaluate the functional organization in multi-task deep neural networks, we propose a method that measures the contribution of a unit towards each task, applying it to two networks that have been trained on either two related or two unrelated tasks, using an identical stimulus set. Results show that the network trained on the unrelated tasks shows a decreasing degree of feature representation sharing towards higher-tier layers while the network trained on related tasks uniformly shows high degree of sharing. We conjecture that the method we propose can be used to analyze the anatomical and functional organization of the visual system and beyond. We predict that the degree to which tasks are related is a good descriptor of the degree to which they share downstream cortical-units.Comment: 16 pages, 5 figure

    A model of emotional influence on memory processing.

    Get PDF
    To survive in a complex environment, agents must be able to encode information about the utility value of the objects they meet. We propose a neuroscience-based model aiming to explain how a new memory is associated to an emotional response. The same theoretical framework also explains the effects of emotion on memory recall. The originality of our approach is to postulate the presence of two central processing units (CPUs): one computing only emotional information, and the other mainly concerned with cognitive processing. The emotional CPU, which is phylogenetically older, is assumed to modulate the cognitive CPU, which is more recent. The article first deals with the cognitive part of the model by highlighting the set of processes underlying memory recognition and storage. Then, building on this theoretical background, the emotional part highlights how the emotional response is computed and stored. The last section describes the interplay between the cognitive and emotional systems

    Visual Entailment: A Novel Task for Fine-Grained Image Understanding

    Get PDF
    Existing visual reasoning datasets such as Visual Question Answering (VQA), often suffer from biases conditioned on the question, image or answer distributions. The recently proposed CLEVR dataset addresses these limitations and requires fine-grained reasoning but the dataset is synthetic and consists of similar objects and sentence structures across the dataset. In this paper, we introduce a new inference task, Visual Entailment (VE) - consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal of a trained VE model is to predict whether the image semantically entails the text. To realize this task, we build a dataset SNLI-VE based on the Stanford Natural Language Inference corpus and Flickr30k dataset. We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task. EVE achieves up to 71% accuracy and outperforms several other state-of-the-art VQA based models. Finally, we demonstrate the explainability of EVE through cross-modal attention visualizations. The SNLI-VE dataset is publicly available at https://github.com/ necla-ml/SNLI-VE

    Consciousness CLEARS the Mind

    Full text link
    A full understanding of consciouness requires that we identify the brain processes from which conscious experiences emerge. What are these processes, and what is their utility in supporting successful adaptive behaviors? Adaptive Resonance Theory (ART) predicted a functional link between processes of Consciousness, Learning, Expectation, Attention, Resonance, and Synchrony (CLEARS), includes the prediction that "all conscious states are resonant states." This connection clarifies how brain dynamics enable a behaving individual to autonomously adapt in real time to a rapidly changing world. The present article reviews theoretical considerations that predicted these functional links, how they work, and some of the rapidly growing body of behavioral and brain data that have provided support for these predictions. The article also summarizes ART models that predict functional roles for identified cells in laminar thalamocortical circuits, including the six layered neocortical circuits and their interactions with specific primary and higher-order specific thalamic nuclei and nonspecific nuclei. These prediction include explanations of how slow perceptual learning can occur more frequently in superficial cortical layers. ART traces these properties to the existence of intracortical feedback loops, and to reset mechanisms whereby thalamocortical mismatches use circuits such as the one from specific thalamic nuclei to nonspecific thalamic nuclei and then to layer 4 of neocortical areas via layers 1-to-5-to-6-to-4.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624
    corecore