178,178 research outputs found
Learning Convolutional Text Representations for Visual Question Answering
Visual question answering is a recently proposed artificial intelligence task
that requires a deep understanding of both images and texts. In deep learning,
images are typically modeled through convolutional neural networks, and texts
are typically modeled through recurrent neural networks. While the requirement
for modeling images is similar to traditional computer vision tasks, such as
object recognition and image classification, visual question answering raises a
different need for textual representation as compared to other natural language
processing tasks. In this work, we perform a detailed analysis on natural
language questions in visual question answering. Based on the analysis, we
propose to rely on convolutional neural networks for learning textual
representations. By exploring the various properties of convolutional neural
networks specialized for text data, such as width and depth, we present our
"CNN Inception + Gate" model. We show that our model improves question
representations and thus the overall accuracy of visual question answering
models. We also show that the text representation requirement in visual
question answering is more complicated and comprehensive than that in
conventional natural language processing tasks, making it a better task to
evaluate textual representation methods. Shallow models like fastText, which
can obtain comparable results with deep learning models in tasks like text
classification, are not suitable in visual question answering.Comment: Conference paper at SDM 2018. https://github.com/divelab/sva
What do Neural Machine Translation Models Learn about Morphology?
Neural machine translation (MT) models obtain state-of-the-art performance
while maintaining a simple, end-to-end architecture. However, little is known
about what these models learn about source and target languages during the
training process. In this work, we analyze the representations learned by
neural MT models at various levels of granularity and empirically evaluate the
quality of the representations for learning morphology through extrinsic
part-of-speech and morphological tagging tasks. We conduct a thorough
investigation along several parameters: word-based vs. character-based
representations, depth of the encoding layer, the identity of the target
language, and encoder vs. decoder representations. Our data-driven,
quantitative evaluation sheds light on important aspects in the neural MT
system and its ability to capture word structure.Comment: Updated decoder experiment
Mapping Instructions and Visual Observations to Actions with Reinforcement Learning
We propose to directly map raw visual observations and text input to actions
for instruction execution. While existing approaches assume access to
structured environment representations or use a pipeline of separately trained
models, we learn a single model to jointly reason about linguistic and visual
input. We use reinforcement learning in a contextual bandit setting to train a
neural network agent. To guide the agent's exploration, we use reward shaping
with different forms of supervision. Our approach does not require intermediate
representations, planning procedures, or training different models. We evaluate
in a simulated environment, and show significant improvements over supervised
learning and common reinforcement learning variants.Comment: In Proceedings of the Conference on Empirical Methods in Natural
Language Processing (EMNLP), 201
- …