1,178 research outputs found

    A Diagram Is Worth A Dozen Images

    Full text link
    Diagrams are common tools for representing complex concepts, relationships and events, often when it would be difficult to portray the same information with natural images. Understanding natural images has been extensively studied in computer vision, while diagram understanding has received little attention. In this paper, we study the problem of diagram interpretation and reasoning, the challenging task of identifying the structure of a diagram and the semantics of its constituents and their relationships. We introduce Diagram Parse Graphs (DPG) as our representation to model the structure of diagrams. We define syntactic parsing of diagrams as learning to infer DPGs for diagrams and study semantic interpretation and reasoning of diagrams in the context of diagram question answering. We devise an LSTM-based method for syntactic parsing of diagrams and introduce a DPG-based attention model for diagram question answering. We compile a new dataset of diagrams with exhaustive annotations of constituents and relationships for over 5,000 diagrams and 15,000 questions and answers. Our results show the significance of our models for syntactic parsing and question answering in diagrams using DPGs

    Visual Question Answering: A Survey of Methods and Datasets

    Full text link
    Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. Given an image and a question in natural language, it requires reasoning over visual elements of the image and general knowledge to infer the correct answer. In the first part of this survey, we examine the state of the art by comparing modern approaches to the problem. We classify methods by their mechanism to connect the visual and textual modalities. In particular, we examine the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space. We also discuss memory-augmented and modular architectures that interface with structured knowledge bases. In the second part of this survey, we review the datasets available for training and evaluating VQA systems. The various datatsets contain questions at different levels of complexity, which require different capabilities and types of reasoning. We examine in depth the question/answer pairs from the Visual Genome project, and evaluate the relevance of the structured annotations of images with scene graphs for VQA. Finally, we discuss promising future directions for the field, in particular the connection to structured knowledge bases and the use of natural language processing models.Comment: 25 page

    Analyzing and Interpreting Neural Networks for NLP: A Report on the First BlackboxNLP Workshop

    Full text link
    The EMNLP 2018 workshop BlackboxNLP was dedicated to resources and techniques specifically developed for analyzing and understanding the inner-workings and representations acquired by neural models of language. Approaches included: systematic manipulation of input to neural networks and investigating the impact on their performance, testing whether interpretable knowledge can be decoded from intermediate representations acquired by neural networks, proposing modifications to neural network architectures to make their knowledge state or generated output more explainable, and examining the performance of networks on simplified or formal languages. Here we review a number of representative studies in each category

    Visual question answering with modules and language modeling

    Get PDF
    L’objectif principal de cette thèse est d’apprendre les représentations modulaires pour la tâche de réponse visuelle aux questions (VQA). Apprendre de telles représentations a le potentiel de généraliser au raisonnement d’ordre supérieur qui prévaut chez l’être humain. Le chapitre 1 traite de la littérature relative à VQA, aux réseaux modulaires et à l’optimisation de la structure neuronale. En particulier, les différents ensembles de données proposés pour étudier cette tâche y sont détaillés. Les modèles de VQA peuvent être classés en deux catégories en fonction des jeux de données auxquels ils conviennent. La première porte sur les questions ouvertes sur les images naturelles. Ces questions concernent principalement quelques objets/personnes présents dans l’image et n’exigent aucune capacité de raisonnement significative pour y répondre. La deuxième catégorie comprend des questions (principalement sur des images synthétiques) qui testent la capacité des modèles à effectuer un raisonnement compositionnel. Nous discutons de différentes variantes architecturales de réseaux de modules neuronaux (NMN). Finalement nous discutons des approches pour apprendre les structures ou modules de réseau neuronal pour des tâches autres que VQA. Au chapitre 2, nous décrivons un moyen d’exécuter de manière parcimonieuse un modèle CNN (ResNeXt [110]) et d’enregistrer les calculs effectués dans le processus. Ici, nous avons utilisé un mélange de formulations d’experts pour n’exécuter que les K meilleurs experts dans chaque bloc convolutionnel. Le groupe d’experts le plus important est sélectionné sur la base d’un contrôleur qui utilise un système d’attention guidé par une question suivie de couches entièrement connectées dans le but d’attribuer des poids à l’ensemble d’experts. Nos expériences montrent qu’il est possible de réaliser des économies énormes sur le nombre de FLOP avec un impact minimal sur la performance. Le chapitre 3 est un prologue du chapitre 4. Il mentionne les contributions clés et fournit une introduction au problème de recherche que nous essayons de traiter dans l’article. Le chapitre 4 contient le contenu de l’article. Ici, nous nous intéressons à l’apprentissage de la structure interne des modules pour les réseaux de modules neuronaux (NMN) [3, 37]. Nous introduisons une nouvelle forme de structure de module qui utilise des opérations arithmétiques élémentaires et la tâche consiste maintenant à connaître les poids de ces opérations pour former la structure de module. Nous plaçons le problème dans une technique d’optimisation à deux niveaux, dans laquelle le modèle prend des gradients de descente alternés dans l’architecture et des espaces de poids. Le chapitre 5 traite d’autres expériences et études d’ablation réalisées dans le contexte de l’article précédent. La plupart des travaux dans la littérature utilisent un réseau de neurones récurrent tel que LSTM [33] ou GRU [13] pour modéliser les caractéristiques de la question. Cependant, les LSTM peuvent échouer à encoder correctement les caractéristiques syntaxiques de la question qui pourraient être essentielles [87]. Récemment, [76] a montré l’utilité de la modélisation du langage pour répondre aux questions. Avec cette motivation, nous essayons d’apprendre un meilleur modèle linguistique qui peut être formé de manière non supervisée. Dans le chapitre 6, nous décrivons un réseau récursif de modélisation de langage dont la structure est alignée pour le langage naturel. Plus techniquement, nous utilisons un modèle d’analyse non supervisée (Parsing Reading Predict Network ou PPRN [86]) et augmentons son étape de prédiction avec un modèle TreeLSTM [99] qui utilise l’arborescence intermédiaire fournie par le modèle PRPN dans le but de un état caché en utilisant la structure arborescente. L’étape de prédiction du modèle PRPN utilise l’état caché, qui est une combinaison pondérée de l’état caché du TreeLSTM et de celui obtenu à partir d’une attention structurée. De cette façon, le modèle peut effectuer une analyse non supervisée et capturer les dépendances à long terme, car la structure existe maintenant explicitement dans le modèle. Nos expériences démontrent que ce modèle conduit à une amélioration de la tâche de modélisation du langage par rapport au référentiel PRPN sur le jeu de données Penn Treebank.The primary focus in this thesis is to learn modularized representations for the task of Visual Question Answering. Learning such representations holds the potential to generalize to higher order reasoning as is prevalent in human beings. Chapter 1 discusses the literature related to VQA, modular networks and neural structure optimization. In particular, it first details different datasets proposed to study this task. The models for VQA can be categorized into two categories based on the datasets they are suitable for. The first one is open-ended questions about natural images. These questions are mostly about a few objects/persons present in the image and don’t require any significant reasoning capability to answer them. The second category comprises of questions (mostly on synthetic images) which tests the ability of models to perform compositional reasoning. We discuss the different architectural variants of Neural Module Networks (NMN). Finally, we discuss approaches to learn the neural network structures or modules for tasks other than VQA. In Chapter 2, we discuss a way to sparsely execute a CNN model (ResNeXt [110]) and save computations in the process. Here, we used a mixture of experts formulation to execute only the top-K experts in each convolutional block. The most important set of experts are selected based on a gate controller which uses a question-guided attention map followed by fully-connected layers to assign weights to the set of experts. Our experiments show that it is possible to get huge savings in the FLOP count with only a minimal degradation in performance. Chapter 3 is a prologue to Chapter 4. It mentions the key contributions and provides an introduction to the research problem which we try to address in the article. Chapter 4 contains the contents of the article. Here, we are interested in learning the internal structure of the modules for Neural Module Networks (NMN) [3, 37]. We introduce a novel form of module structure which uses elementary arithmetic operations and now the task is to learn the weights of these operations to form the module structure. We cast the problem into a bi-level optimization technique in which the model takes alternating gradient descent steps in architecture and weight spaces. Chapter 5 discusses additional experiments and ablation studies that were done in the context of the previous article. Most works in the literature use a recurrent neural network like LSTM [33] or GRU [13] to model the question features. However, LSTMs can fail to properly encode syntactic features of the question which could be vital to answering some VQA questions [87]. Recently, [76] has shown the utility of language modeling for question-answering. With this motivation, we try to learn a better language model which can be trained in an unsupervised manner. In Chapter 6, we discuss a recursive network for language modeling whose structure aligns with the natural language. More technically, we make use of an unsupervised parsing model (Parsing Reading Predict Network or PPRN [86]) and augment its prediction step with a TreeLSTM [99] model which makes use of the intermediate tree structure given by PRPN model to output a hidden state by utilizing the tree structure. The predict step of PRPN model makes use of a hidden state which is a weighted combination of the TreeLSTM’s hidden state and the one obtained from structured attention. This way it helps the model to do unsupervised parsing and also capture long-term dependencies as the structure now explicitly exists in the model. Our experiments demonstrate that this model leads to improvement on language modeling task over the PRPN baseline on Penn Treebank dataset

    Structure Learning for Neural Module Networks

    Full text link
    Neural Module Networks, originally proposed for the task of visual question answering, are a class of neural network architectures that involve human-specified neural modules, each designed for a specific form of reasoning. In current formulations of such networks only the parameters of the neural modules and/or the order of their execution is learned. In this work, we further expand this approach and also learn the underlying internal structure of modules in terms of the ordering and combination of simple and elementary arithmetic operators. Our results show that one is indeed able to simultaneously learn both internal module structure and module sequencing without extra supervisory signals for module execution sequencing. With this approach, we report performance comparable to models using hand-designed modules
    • …
    corecore