15,135 research outputs found
Learning to Generate Unambiguous Spatial Referring Expressions for Real-World Environments
Referring to objects in a natural and unambiguous manner is crucial for
effective human-robot interaction. Previous research on learning-based
referring expressions has focused primarily on comprehension tasks, while
generating referring expressions is still mostly limited to rule-based methods.
In this work, we propose a two-stage approach that relies on deep learning for
estimating spatial relations to describe an object naturally and unambiguously
with a referring expression. We compare our method to the state of the art
algorithm in ambiguous environments (e.g., environments that include very
similar objects with similar relationships). We show that our method generates
referring expressions that people find to be more accurate (30% better)
and would prefer to use (32% more often).Comment: International Conference on Intelligent Robots and Systems (IROS
2019), Demo 1: Finding the described object (https://youtu.be/BE6-F6chW0w),
Demo 2: Referring to the pointed object (https://youtu.be/nmmv6JUpy8M),
Supplementary Video (https://youtu.be/sFjBa_MHS98
A Joint Speaker-Listener-Reinforcer Model for Referring Expressions
Referring expressions are natural language constructions used to identify
particular objects within a scene. In this paper, we propose a unified
framework for the tasks of referring expression comprehension and generation.
Our model is composed of three modules: speaker, listener, and reinforcer. The
speaker generates referring expressions, the listener comprehends referring
expressions, and the reinforcer introduces a reward function to guide sampling
of more discriminative expressions. The listener-speaker modules are trained
jointly in an end-to-end learning framework, allowing the modules to be aware
of one another during learning while also benefiting from the discriminative
reinforcer's feedback. We demonstrate that this unified framework and training
achieves state-of-the-art results for both comprehension and generation on
three referring expression datasets. Project and demo page:
https://vision.cs.unc.edu/referComment: Some typo fixed; comprehension results on refcocog updated; more
human evaluation results adde
Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries
Recognising objects according to a pre-defined fixed set of class labels has
been well studied in the Computer Vision. There are a great many practical
applications where the subjects that may be of interest are not known
beforehand, or so easily delineated, however. In many of these cases natural
language dialog is a natural way to specify the subject of interest, and the
task achieving this capability (a.k.a, Referring Expression Comprehension) has
recently attracted attention. To this end we propose a unified framework, the
ParalleL AttentioN (PLAN) network, to discover the object in an image that is
being referred to in variable length natural expression descriptions, from
short phrases query to long multi-round dialogs. The PLAN network has two
attention mechanisms that relate parts of the expressions to both the global
visual content and also directly to object candidates. Furthermore, the
attention mechanisms are recurrent, making the referring process visualizable
and explainable. The attended information from these dual sources are combined
to reason about the referred object. These two attention mechanisms can be
trained in parallel and we find the combined system outperforms the
state-of-art on several benchmarked datasets with different length language
input, such as RefCOCO, RefCOCO+ and GuessWhat?!.Comment: 11 page
Towards an Indexical Model of Situated Language Comprehension for Cognitive Agents in Physical Worlds
We propose a computational model of situated language comprehension based on
the Indexical Hypothesis that generates meaning representations by translating
amodal linguistic symbols to modal representations of beliefs, knowledge, and
experience external to the linguistic system. This Indexical Model incorporates
multiple information sources, including perceptions, domain knowledge, and
short-term and long-term experiences during comprehension. We show that
exploiting diverse information sources can alleviate ambiguities that arise
from contextual use of underspecific referring expressions and unexpressed
argument alternations of verbs. The model is being used to support linguistic
interactions in Rosie, an agent implemented in Soar that learns from
instruction.Comment: Advances in Cognitive Systems 3 (2014
Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions
Comprehension of spoken natural language is an essential component for robots
to communicate with human effectively. However, handling unconstrained spoken
instructions is challenging due to (1) complex structures including a wide
variety of expressions used in spoken language and (2) inherent ambiguity in
interpretation of human instructions. In this paper, we propose the first
comprehensive system that can handle unconstrained spoken language and is able
to effectively resolve ambiguity in spoken instructions. Specifically, we
integrate deep-learning-based object detection together with natural language
processing technologies to handle unconstrained spoken instructions, and
propose a method for robots to resolve instruction ambiguity through dialogue.
Through our experiments on both a simulated environment as well as a physical
industrial robot arm, we demonstrate the ability of our system to understand
natural instructions from human operators effectively, and how higher success
rates of the object picking task can be achieved through an interactive
clarification process.Comment: 9 pages. International Conference on Robotics and Automation (ICRA)
2018. Accompanying videos are available at the following links:
https://youtu.be/_Uyv1XIUqhk (the system submitted to ICRA-2018) and
http://youtu.be/DGJazkyw0Ws (with improvements after ICRA-2018 submission
- …