9,954 research outputs found
Towards an Indexical Model of Situated Language Comprehension for Cognitive Agents in Physical Worlds
We propose a computational model of situated language comprehension based on
the Indexical Hypothesis that generates meaning representations by translating
amodal linguistic symbols to modal representations of beliefs, knowledge, and
experience external to the linguistic system. This Indexical Model incorporates
multiple information sources, including perceptions, domain knowledge, and
short-term and long-term experiences during comprehension. We show that
exploiting diverse information sources can alleviate ambiguities that arise
from contextual use of underspecific referring expressions and unexpressed
argument alternations of verbs. The model is being used to support linguistic
interactions in Rosie, an agent implemented in Soar that learns from
instruction.Comment: Advances in Cognitive Systems 3 (2014
Do You See What I Mean? Visual Resolution of Linguistic Ambiguities
Understanding language goes hand in hand with the ability to integrate
complex contextual information obtained via perception. In this work, we
present a novel task for grounded language understanding: disambiguating a
sentence given a visual scene which depicts one of the possible interpretations
of that sentence. To this end, we introduce a new multimodal corpus containing
ambiguous sentences, representing a wide range of syntactic, semantic and
discourse ambiguities, coupled with videos that visualize the different
interpretations for each sentence. We address this task by extending a vision
model which determines if a sentence is depicted by a video. We demonstrate how
such a model can be adjusted to recognize different interpretations of the same
underlying sentence, allowing to disambiguate sentences in a unified fashion
across the different ambiguity types.Comment: EMNLP 201
Learning language through pictures
We propose Imaginet, a model of learning visually grounded representations of
language from coupled textual and visual input. The model consists of two Gated
Recurrent Unit networks with shared word embeddings, and uses a multi-task
objective by receiving a textual description of a scene and trying to
concurrently predict its visual representation and the next word in the
sentence. Mimicking an important aspect of human language learning, it acquires
meaning representations for individual words from descriptions of visual
scenes. Moreover, it learns to effectively use sequential structure in semantic
interpretation of multi-word phrases.Comment: To appear at ACL 201
OBJ2TEXT: Generating Visually Descriptive Language from Object Layouts
Generating captions for images is a task that has recently received
considerable attention. In this work we focus on caption generation for
abstract scenes, or object layouts where the only information provided is a set
of objects and their locations. We propose OBJ2TEXT, a sequence-to-sequence
model that encodes a set of objects and their locations as an input sequence
using an LSTM network, and decodes this representation using an LSTM language
model. We show that our model, despite encoding object layouts as a sequence,
can represent spatial relationships between objects, and generate descriptions
that are globally coherent and semantically relevant. We test our approach in a
task of object-layout captioning by using only object annotations as inputs. We
additionally show that our model, combined with a state-of-the-art object
detector, improves an image captioning model from 0.863 to 0.950 (CIDEr score)
in the test benchmark of the standard MS-COCO Captioning task.Comment: Accepted at EMNLP 201
Grounded Semantic Composition for Visual Scenes
We present a visually-grounded language understanding model based on a study
of how people verbally describe objects in scenes. The emphasis of the model is
on the combination of individual word meanings to produce meanings for complex
referring expressions. The model has been implemented, and it is able to
understand a broad range of spatial referring expressions. We describe our
implementation of word level visually-grounded semantics and their embedding in
a compositional parsing framework. The implemented system selects the correct
referents in response to natural language expressions for a large percentage of
test cases. In an analysis of the system's successes and failures we reveal how
visual context influences the semantics of utterances and propose future
extensions to the model that take such context into account
Training an adaptive dialogue policy for interactive learning of visually grounded word meanings
We present a multi-modal dialogue system for interactive learning of
perceptually grounded word meanings from a human tutor. The system integrates
an incremental, semantic parsing/generation framework - Dynamic Syntax and Type
Theory with Records (DS-TTR) - with a set of visual classifiers that are
learned throughout the interaction and which ground the meaning representations
that it produces. We use this system in interaction with a simulated human
tutor to study the effects of different dialogue policies and capabilities on
the accuracy of learned meanings, learning rates, and efforts/costs to the
tutor. We show that the overall performance of the learning agent is affected
by (1) who takes initiative in the dialogues; (2) the ability to express/use
their confidence level about visual attributes; and (3) the ability to process
elliptical and incrementally constructed dialogue turns. Ultimately, we train
an adaptive dialogue policy which optimises the trade-off between classifier
accuracy and tutoring costs.Comment: 11 pages, SIGDIAL 2016 Conferenc
- …