417 research outputs found
Object Referring in Videos with Language and Human Gaze
We investigate the problem of object referring (OR) i.e. to localize a target
object in a visual scene coming with a language description. Humans perceive
the world more as continued video snippets than as static images, and describe
objects not only by their appearance, but also by their spatio-temporal context
and motion features. Humans also gaze at the object when they issue a referring
expression. Existing works for OR mostly focus on static images only, which
fall short in providing many such cues. This paper addresses OR in videos with
language and human gaze. To that end, we present a new video dataset for OR,
with 30, 000 objects over 5, 000 stereo video sequences annotated for their
descriptions and gaze. We further propose a novel network model for OR in
videos, by integrating appearance, motion, gaze, and spatio-temporal context
into one network. Experimental results show that our method effectively
utilizes motion cues, human gaze, and spatio-temporal context. Our method
outperforms previousOR methods. For dataset and code, please refer
https://people.ee.ethz.ch/~arunv/ORGaze.html.Comment: Accepted to CVPR 2018, 10 pages, 6 figure
PROGrasp: Pragmatic Human-Robot Communication for Object Grasping
Interactive Object Grasping (IOG) is the task of identifying and grasping the
desired object via human-robot natural language interaction. Current IOG
systems assume that a human user initially specifies the target object's
category (e.g., bottle). Inspired by pragmatics, where humans often convey
their intentions by relying on context to achieve goals, we introduce a new IOG
task, Pragmatic-IOG, and the corresponding dataset, Intention-oriented
Multi-modal Dialogue (IM-Dial). In our proposed task scenario, an
intention-oriented utterance (e.g., "I am thirsty") is initially given to the
robot. The robot should then identify the target object by interacting with a
human user. Based on the task setup, we propose a new robotic system that can
interpret the user's intention and pick up the target object, Pragmatic Object
Grasping (PROGrasp). PROGrasp performs Pragmatic-IOG by incorporating modules
for visual grounding, question asking, object grasping, and most importantly,
answer interpretation for pragmatic inference. Experimental results show that
PROGrasp is effective in offline (i.e., target object discovery) and online
(i.e., IOG with a physical robot arm) settings.Comment: 7 pages, 6 figure
- …