2,151 research outputs found
Visual world studies of conversational perspective taking: similar findings, diverging interpretations
Visual-world eyetracking greatly expanded the potential for insight into how listeners access and use common ground during situated language comprehension. Past reviews of visual world studies on perspective taking have largely taken the diverging findings of the various studies at face value, and attributed these apparently different findings to differences in the extent to which the paradigms used by different labs afford collaborative interaction. Researchers are asking questions about perspective taking of an increasingly nuanced and sophisticated nature, a clear indicator of progress. But this research has the potential not only to improve our understanding of conversational perspective taking. Grappling with problems of data interpretation in such a complex domain has the unique potential to drive visual world researchers to a deeper understanding of how to best map visual world data onto psycholinguistic theory. I will argue against this interactional affordances explanation, on two counts. First, it implies that interactivity affects the overall ability to form common ground, and thus provides no straightforward explanation of why, within a single noninteractive study, common ground can have very large effects on some aspects of processing (referential anticipation) while having negligible effects on others (lexical processing). Second, and more importantly, the explanation accepts the divergence in published findings at face value. However, a closer look at several key studies shows that the divergences are more likely to reflect inconsistent practices of analysis and interpretation that have been applied to an underlying body of data that is, in fact, surprisingly consistent. The diverging interpretations, I will argue, are the result of differences in the handling of anticipatory baseline effects (ABEs) in the analysis of visual world data. ABEs arise in perspective-taking studies because listeners have earlier access to constraining information about who knows what than they have to referential speech, and thus can already show biases in visual attention even before the processing of any referential speech has begun. To be sure, these ABEs clearly indicate early access to common ground; however, access does not imply integration, since it is possible that this information is not used later to modulate the processing of incoming speech. Failing to account for these biases using statistical or experimental controls leads to over-optimistic assessments of listeners’ ability to integrate this information with incoming speech. I will show that several key studies with varying degrees of interactional affordances all show similar temporal profiles of common ground use during the interpretive process: early anticipatory effects, followed by bottom-up effects of lexical processing that are not modulated by common ground, followed (optionally) by further late effects that are likely to be post-lexical. Furthermore, this temporal profile for common ground radically differs from the profile of contextual effects related to verb semantics. Together, these findings are consistent with the proposal that lexical processes are encapsulated from common ground, but cannot be straightforwardly accounted for by probabilistic constraint-based approaches
The positive side of a negative reference: the delay between linguistic processing and common ground
Interlocutors converge on names to refer to entities. For example, a speaker might refer to a novel looking object as the jellyfish and, once identified, the listener will too. The hypothesized mechanism behind such referential precedents is a subject of debate. The common ground view claims that listeners register the object as well as the identity of the speaker who coined the label. The linguistic view claims that, once established, precedents are treated by listeners like any other linguistic unit, i.e. without needing to keep track of the speaker. To test predictions from each account, we used visual-world eyetracking, which allows observations in real time, during a standard referential communication task. Participants had to select objects based on instructions from two speakers. In the critical condition, listeners sought an object with a negative reference such as not the jellyfish. We aimed to determine the extent to which listeners rely on the linguistic input, common ground or both. We found that initial interpretations were based on linguistic processing only and that common ground considerations do emerge but only after 1000 ms. Our findings support the idea that-at least temporally-linguistic processing can be isolated from common ground
A Comparison of Visualisation Methods for Disambiguating Verbal Requests in Human-Robot Interaction
Picking up objects requested by a human user is a common task in human-robot
interaction. When multiple objects match the user's verbal description, the
robot needs to clarify which object the user is referring to before executing
the action. Previous research has focused on perceiving user's multimodal
behaviour to complement verbal commands or minimising the number of follow up
questions to reduce task time. In this paper, we propose a system for reference
disambiguation based on visualisation and compare three methods to disambiguate
natural language instructions. In a controlled experiment with a YuMi robot, we
investigated real-time augmentations of the workspace in three conditions --
mixed reality, augmented reality, and a monitor as the baseline -- using
objective measures such as time and accuracy, and subjective measures like
engagement, immersion, and display interference. Significant differences were
found in accuracy and engagement between the conditions, but no differences
were found in task time. Despite the higher error rates in the mixed reality
condition, participants found that modality more engaging than the other two,
but overall showed preference for the augmented reality condition over the
monitor and mixed reality conditions
A Review of Verbal and Non-Verbal Human-Robot Interactive Communication
In this paper, an overview of human-robot interactive communication is
presented, covering verbal as well as non-verbal aspects of human-robot
interaction. Following a historical introduction, and motivation towards fluid
human-robot communication, ten desiderata are proposed, which provide an
organizational axis both of recent as well as of future research on human-robot
communication. Then, the ten desiderata are examined in detail, culminating to
a unifying discussion, and a forward-looking conclusion
Referring Expression Comprehension: A Survey of Methods and Datasets
Referring expression comprehension (REC) aims to localize a target object in
an image described by a referring expression phrased in natural language.
Different from the object detection task that queried object labels have been
pre-defined, the REC problem only can observe the queries during the test. It
thus more challenging than a conventional computer vision problem. This task
has attracted a lot of attention from both computer vision and natural language
processing community, and several lines of work have been proposed, from
CNN-RNN model, modular network to complex graph-based model. In this survey, we
first examine the state of the art by comparing modern approaches to the
problem. We classify methods by their mechanism to encode the visual and
textual modalities. In particular, we examine the common approach of joint
embedding images and expressions to a common feature space. We also discuss
modular architectures and graph-based models that interface with structured
graph representation. In the second part of this survey, we review the datasets
available for training and evaluating REC systems. We then group results
according to the datasets, backbone models, settings so that they can be fairly
compared. Finally, we discuss promising future directions for the field, in
particular the compositional referring expression comprehension that requires
longer reasoning chain to address.Comment: Accepted to IEEE TM
Symbol Emergence in Robotics: A Survey
Humans can learn the use of language through physical interaction with their
environment and semiotic communication with other people. It is very important
to obtain a computational understanding of how humans can form a symbol system
and obtain semiotic skills through their autonomous mental development.
Recently, many studies have been conducted on the construction of robotic
systems and machine-learning methods that can learn the use of language through
embodied multimodal interaction with their environment and other systems.
Understanding human social interactions and developing a robot that can
smoothly communicate with human users in the long term, requires an
understanding of the dynamics of symbol systems and is crucially important. The
embodied cognition and social interaction of participants gradually change a
symbol system in a constructive manner. In this paper, we introduce a field of
research called symbol emergence in robotics (SER). SER is a constructive
approach towards an emergent symbol system. The emergent symbol system is
socially self-organized through both semiotic communications and physical
interactions with autonomous cognitive developmental agents, i.e., humans and
developmental robots. Specifically, we describe some state-of-art research
topics concerning SER, e.g., multimodal categorization, word discovery, and a
double articulation analysis, that enable a robot to obtain words and their
embodied meanings from raw sensory--motor information, including visual
information, haptic information, auditory information, and acoustic speech
signals, in a totally unsupervised manner. Finally, we suggest future
directions of research in SER.Comment: submitted to Advanced Robotic
- …