16 research outputs found
Learning to Generate Unambiguous Spatial Referring Expressions for Real-World Environments
Referring to objects in a natural and unambiguous manner is crucial for
effective human-robot interaction. Previous research on learning-based
referring expressions has focused primarily on comprehension tasks, while
generating referring expressions is still mostly limited to rule-based methods.
In this work, we propose a two-stage approach that relies on deep learning for
estimating spatial relations to describe an object naturally and unambiguously
with a referring expression. We compare our method to the state of the art
algorithm in ambiguous environments (e.g., environments that include very
similar objects with similar relationships). We show that our method generates
referring expressions that people find to be more accurate (30% better)
and would prefer to use (32% more often).Comment: International Conference on Intelligent Robots and Systems (IROS
2019), Demo 1: Finding the described object (https://youtu.be/BE6-F6chW0w),
Demo 2: Referring to the pointed object (https://youtu.be/nmmv6JUpy8M),
Supplementary Video (https://youtu.be/sFjBa_MHS98
Searching for Ambiguous Objects in Videos using Relational Referring Expressions
Humans frequently use referring (identifying) expressions to refer to
objects. Especially in ambiguous settings, humans prefer expressions (called
relational referring expressions) that describe an object with respect to a
distinguishing, unique object. Unlike studies on video object search using
referring expressions, in this paper, our focus is on (i) relational referring
expressions in highly ambiguous settings, and (ii) methods that can both
generate and comprehend a referring expression. For this goal, we first
introduce a new dataset for video object search with referring expressions that
includes numerous copies of the objects, making it difficult to use
non-relational expressions. Moreover, we train two baseline deep networks on
this dataset, which show promising results. Finally, we propose a deep
attention network that significantly outperforms the baselines on our dataset.
The dataset and the codes are available at
https://github.com/hazananayurt/viref.Comment: BMVC 2019 camera read
Talking about Relations:Factors Influencing the Production of Relational Descriptions
In a production experiment (Experiment 1) and an acceptability rating one (Experiment 2), we assessed two factors, spatial position and salience, which may influence the production of relational descriptions (such as the ball between the man and the drawer). In Experiment 1, speakers were asked to refer unambiguously to a target object (a ball). In Experiment 1a, we addressed the role of spatial position, more specifically if speakers mention the entity positioned leftmost in the scene as (first) relatum. The results showed a preference to start with the left entity, however, only as a trend, which leaves room for other factors that could influence spatial reference. Thus, in the following studies, we varied salience systematically, by making one of the relatum candidates animate (Experiment 1b), and by adding attention capture cues, first subliminally by priming one relatum candidate with a flash (Experiment 1c), then explicitly by using salient colors for objects (Experiment 1d). Results indicate that spatial position played a dominant role. Entities on the left were mentioned more often as (first) relatum than those on the right (Experiment 1a, 1b, 1c, 1d). Animacy affected reference production in one out of three studies (in Experiment 1d). When salience was manipulated by priming visual attention or by using salient colors, there were no significant effects (Experiment 1c, 1d). In the acceptability rating study (Experiment 2), participants expressed their preference for specific relata, by ranking descriptions on the basis of how good they thought the descriptions fitted the scene. Results show that participants preferred most the description that had an animate entity as the first mentioned relatum. The relevance of these results for models of reference production is discussed
QUESTION ANSWERING, GROUNDING, AND GENERATION FOR VISION AND LANGUAGE
One ultimate goal of AI is to develop an artificial intelligent (AI) system that can communicate with people in a natural way. Such communication includes but is not limited to asking we humans questions, answering our questions, conducting dialogue with human beings, and performing some actions to better serve people. Imagine in the future where the service robot is everywhere, and we could ask our home robot to “grab me the red cup on the table.” To perform this command, the AI system needs to understand this spoken English sentence, perceive the visual world, navigate to the right place “table”, recognize the right object “the red cup”, then grab it and finally return it back to the commander. Just for this single command, it already involves many techniques, such as speech recognition, language understanding, scene understanding, embodied navigation, object recognition, pose estimation, robot manipulation, etc. Each of these techniques are not well solved yet, but we are on a rapid way toward the success. This thesis is in advancing our knowledge to explore various connections between vision, language and even beyond to push forward this ultimate goal. We study 3 popular vision and language tasks, including visual question answering, language grounding, and image-to-text language generation. Inside each, we will introduce our proposed novel task, accompanied with high-quality dataset and well-performing data-driven approaches. Specifically, we first introduce Visual Madlibs for image-based and region-based question answering. Then we introduce referring expressions, where we study both referring expression comprehension and generation, covering both language grounding and generation. Next, we study album summarization, which not only selects the key photos inside an album but also generates a natural language story describing the whole album. Last but not least, we describe multi-target embodied question answering, a task that is even closer to our ultimate goal that requires both language understanding and navigation ability from the AI system.Doctor of Philosoph
The effectiveness of dynamically processed incremental descriptions in human robot interaction
We explore the effectiveness of a dynamically processed incremental referring description system using under-specified ambiguous descriptions that are then built upon using linguistic repair statements, which we refer to as a dynamic system. We build a dynamically processed incremental referring description generation system that is able to provide contextual navigational statements to describe an object in a potential real-world situation of nuclear waste sorting and maintenance. In a study of 31 participants, we test the dynamic system in a case where a user is remote operating a robot to sort nuclear waste, with the robot assisting them in identifying the correct barrels to be removed. We compare these against a static non-ambiguous description given in the same scenario. As well as looking at efficiency with time and distance measurements, we also look at user preference. Results show that our dynamic system was a much more efficient method—taking only 62% of the time on average—for finding the correct barrel. Participants also favoured our dynamic system