359,277 research outputs found

    Learning Convolutional Text Representations for Visual Question Answering

    Full text link
    Visual question answering is a recently proposed artificial intelligence task that requires a deep understanding of both images and texts. In deep learning, images are typically modeled through convolutional neural networks, and texts are typically modeled through recurrent neural networks. While the requirement for modeling images is similar to traditional computer vision tasks, such as object recognition and image classification, visual question answering raises a different need for textual representation as compared to other natural language processing tasks. In this work, we perform a detailed analysis on natural language questions in visual question answering. Based on the analysis, we propose to rely on convolutional neural networks for learning textual representations. By exploring the various properties of convolutional neural networks specialized for text data, such as width and depth, we present our "CNN Inception + Gate" model. We show that our model improves question representations and thus the overall accuracy of visual question answering models. We also show that the text representation requirement in visual question answering is more complicated and comprehensive than that in conventional natural language processing tasks, making it a better task to evaluate textual representation methods. Shallow models like fastText, which can obtain comparable results with deep learning models in tasks like text classification, are not suitable in visual question answering.Comment: Conference paper at SDM 2018. https://github.com/divelab/sva

    Enhancing natural language understanding using meaning representation and deep learning

    Get PDF
    Natural Language Understanding (NLU) is one of the complex tasks in artificial intelligence. Machine learning was introduced to address the complex and dynamic nature of natural language. Deep learning gained popularity within the NLU community due to its capability of learning features directly from data, as well as learning from the dynamic nature of natural language. Furthermore, deep learning has shown to be able to learn the hidden feature(s) automatically and outperform most of the other machine learning approaches for NLU. Deep learning models require natural language inputs to be converted to vectors (word embedding). Word2Vec and GloVe are word embeddings which are designed to capture the analogy context-based statistics and provide lexical relations on words. Using the context-based statistical approach does not capture the prior knowledge required to understand language combined with words. Although a deep learning model receives word embedding, language understanding requires Reasoning, Attention and Memory (RAM). RAM are key factors in understanding language. Current deep learning models focus either on reasoning, attention or memory. In order to properly understand a language however, all three factors of RAM should be considered. Also, a language normally has a long sequence. This long sequence creates dependencies which are required in order to understand a language. However, current deep learning models, which are developed to hold longer sequences, either forget or get affected by the vanishing or exploding gradient descent. In this thesis, these three main areas are of focus. A word embedding technique, which integrates analogy context-based statistical and semantic relationships, as well as extracts from a knowledge base to hold enhanced meaning representation, is introduced. Also, a Long Short-Term Reinforced Memory (LSTRM) network is introduced. This addresses RAM and is validated by testing on question answering data sets which require RAM. Finally, a Long Term Memory Network (LTM) is introduced to address language modelling. Good language modelling requires learning from long sequences. Therefore, this thesis demonstrates that integrating semantic knowledge and a knowledge base generates enhanced meaning and deep learning models that are capable of achieving RAM and long-term dependencies so as to improve the capability of NLU
    • …
    corecore