69,165 research outputs found
Object Referring in Visual Scene with Spoken Language
Object referring has important applications, especially for human-machine
interaction. While having received great attention, the task is mainly attacked
with written language (text) as input rather than spoken language (speech),
which is more natural. This paper investigates Object Referring with Spoken
Language (ORSpoken) by presenting two datasets and one novel approach. Objects
are annotated with their locations in images, text descriptions and speech
descriptions. This makes the datasets ideal for multi-modality learning. The
approach is developed by carefully taking down ORSpoken problem into three
sub-problems and introducing task-specific vision-language interactions at the
corresponding levels. Experiments show that our method outperforms competing
methods consistently and significantly. The approach is also evaluated in the
presence of audio noise, showing the efficacy of the proposed vision-language
interaction methods in counteracting background noise.Comment: 10 pages, Submitted to WACV 201
QUESTION ANSWERING, GROUNDING, AND GENERATION FOR VISION AND LANGUAGE
One ultimate goal of AI is to develop an artificial intelligent (AI) system that can communicate with people in a natural way. Such communication includes but is not limited to asking we humans questions, answering our questions, conducting dialogue with human beings, and performing some actions to better serve people. Imagine in the future where the service robot is everywhere, and we could ask our home robot to “grab me the red cup on the table.” To perform this command, the AI system needs to understand this spoken English sentence, perceive the visual world, navigate to the right place “table”, recognize the right object “the red cup”, then grab it and finally return it back to the commander. Just for this single command, it already involves many techniques, such as speech recognition, language understanding, scene understanding, embodied navigation, object recognition, pose estimation, robot manipulation, etc. Each of these techniques are not well solved yet, but we are on a rapid way toward the success. This thesis is in advancing our knowledge to explore various connections between vision, language and even beyond to push forward this ultimate goal. We study 3 popular vision and language tasks, including visual question answering, language grounding, and image-to-text language generation. Inside each, we will introduce our proposed novel task, accompanied with high-quality dataset and well-performing data-driven approaches. Specifically, we first introduce Visual Madlibs for image-based and region-based question answering. Then we introduce referring expressions, where we study both referring expression comprehension and generation, covering both language grounding and generation. Next, we study album summarization, which not only selects the key photos inside an album but also generates a natural language story describing the whole album. Last but not least, we describe multi-target embodied question answering, a task that is even closer to our ultimate goal that requires both language understanding and navigation ability from the AI system.Doctor of Philosoph
Object Referring in Videos with Language and Human Gaze
We investigate the problem of object referring (OR) i.e. to localize a target
object in a visual scene coming with a language description. Humans perceive
the world more as continued video snippets than as static images, and describe
objects not only by their appearance, but also by their spatio-temporal context
and motion features. Humans also gaze at the object when they issue a referring
expression. Existing works for OR mostly focus on static images only, which
fall short in providing many such cues. This paper addresses OR in videos with
language and human gaze. To that end, we present a new video dataset for OR,
with 30, 000 objects over 5, 000 stereo video sequences annotated for their
descriptions and gaze. We further propose a novel network model for OR in
videos, by integrating appearance, motion, gaze, and spatio-temporal context
into one network. Experimental results show that our method effectively
utilizes motion cues, human gaze, and spatio-temporal context. Our method
outperforms previousOR methods. For dataset and code, please refer
https://people.ee.ethz.ch/~arunv/ORGaze.html.Comment: Accepted to CVPR 2018, 10 pages, 6 figure
Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions
Comprehension of spoken natural language is an essential component for robots
to communicate with human effectively. However, handling unconstrained spoken
instructions is challenging due to (1) complex structures including a wide
variety of expressions used in spoken language and (2) inherent ambiguity in
interpretation of human instructions. In this paper, we propose the first
comprehensive system that can handle unconstrained spoken language and is able
to effectively resolve ambiguity in spoken instructions. Specifically, we
integrate deep-learning-based object detection together with natural language
processing technologies to handle unconstrained spoken instructions, and
propose a method for robots to resolve instruction ambiguity through dialogue.
Through our experiments on both a simulated environment as well as a physical
industrial robot arm, we demonstrate the ability of our system to understand
natural instructions from human operators effectively, and how higher success
rates of the object picking task can be achieved through an interactive
clarification process.Comment: 9 pages. International Conference on Robotics and Automation (ICRA)
2018. Accompanying videos are available at the following links:
https://youtu.be/_Uyv1XIUqhk (the system submitted to ICRA-2018) and
http://youtu.be/DGJazkyw0Ws (with improvements after ICRA-2018 submission
Learning to Generate Unambiguous Spatial Referring Expressions for Real-World Environments
Referring to objects in a natural and unambiguous manner is crucial for
effective human-robot interaction. Previous research on learning-based
referring expressions has focused primarily on comprehension tasks, while
generating referring expressions is still mostly limited to rule-based methods.
In this work, we propose a two-stage approach that relies on deep learning for
estimating spatial relations to describe an object naturally and unambiguously
with a referring expression. We compare our method to the state of the art
algorithm in ambiguous environments (e.g., environments that include very
similar objects with similar relationships). We show that our method generates
referring expressions that people find to be more accurate (30% better)
and would prefer to use (32% more often).Comment: International Conference on Intelligent Robots and Systems (IROS
2019), Demo 1: Finding the described object (https://youtu.be/BE6-F6chW0w),
Demo 2: Referring to the pointed object (https://youtu.be/nmmv6JUpy8M),
Supplementary Video (https://youtu.be/sFjBa_MHS98
The iconicity advantage in sign production: The case of bimodal bilinguals
Recent evidence demonstrates that pictures corresponding to iconic signs are named faster
than pictures corresponding to non-iconic signs. The present study investigates the locus of
the iconicity advantage in hearing bimodal bilinguals. A naming experiment with iconic and noniconic
pictures in Italian Sign Language (LIS) was conducted. Bimodal bilinguals named the pictures
either using a noun construction that involved the production of the sign corresponding to the
picture or using a marked demonstrative pronoun construction replacing the picture sign. In this
last condition, the pictures were colored and participants were instructed to name the pronoun
together with the color. The iconicity advantage was reliable in the noun utterance but not in
the marked demonstrative pronoun utterance. In a third condition, the colored pictures were
presented as distractor stimuli and participants required to name the color. In this last condition,
distractor pictures with iconic signs elicited faster naming latencies than non-iconic signs. The
results suggest that the advantage of iconic signs in production arises at the level of semantic-tophonological
links. In addition, we conclude that bimodal bilinguals and native signers do not differ
in terms of the activation flow within the sign production system
- …