Search CORE

141,065 research outputs found

Empirically-based control of natural language generation

Author: Evans Roger
Paiva Daniel
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2005
Field of study

Crossref

University of Brighton Research Portal

Jointly Modeling Embedding and Translation to Bridge Video and Language

Author: Houqiang Li
Tao Mei
Ting Yao
Yingwei Pan
Yong Rui
†
Publication venue
Publication date: 04/06/2015
Field of study

Automatically describing video content with natural language is a fundamental challenge of multimedia. Recurrent Neural Networks (RNN), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while the latter is to create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content. Our proposed LSTM-E consists of three components: a 2-D and/or 3-D deep convolutional neural networks for learning powerful video representation, a deep RNN for generating sentences, and a joint embedding model for exploring the relationships between visual content and sentence semantics. The experiments on YouTube2Text dataset show that our proposed LSTM-E achieves to-date the best reported performance in generating natural sentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. We also demonstrate that LSTM-E is superior in predicting Subject-Verb-Object (SVO) triplets to several state-of-the-art techniques

arXiv.org e-Print Archive

CiteSeerX

Crossref

Mapping Instructions and Visual Observations to Actions with Reinforcement Learning

Author: Artzi Yoav
Langford John
Misra Dipendra
Publication venue
Publication date: 01/01/2017
Field of study

We propose to directly map raw visual observations and text input to actions for instruction execution. While existing approaches assume access to structured environment representations or use a pipeline of separately trained models, we learn a single model to jointly reason about linguistic and visual input. We use reinforcement learning in a contextual bandit setting to train a neural network agent. To guide the agent's exploration, we use reward shaping with different forms of supervision. Our approach does not require intermediate representations, planning procedures, or training different models. We evaluate in a simulated environment, and show significant improvements over supervised learning and common reinforcement learning variants.Comment: In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 201

arXiv.org e-Print Archive

Crossref