392 research outputs found
Recommended from our members
A corpus-based analysis of route instructions in human-robot interaction
This paper investigates how users employ spatial descriptions to navigate a speech-enabled robot. We created a simulated environment in which users gave route instructions in a dialogic real-time interaction with a robot, which was
operated by naĂŻve participants. The ability of robot monitoring was also manipulated in two experimental conditions. The results provide evidence that the content of the instructions and strategies of the users vary depending on the conditions and
demands of the interaction. As expected, the route instructions frequently were underspecified and arbitrary. The findings of
this study elucidate the complexity in interpreting spatial language in HRI. However, they also point to the need for
endowing mobile robots with richer dialogue resources to compensate for the uncertainties arising from language as well
as the environment
Visual Complexity and Its Effects on Referring Expression Generation
Speakersâ perception of a visual scene influences the language they use to describe itâwhich objects they choose to mention and how they characterize the relationships between them. We show that visual complexity can either delay or facilitate description generation, depending on how much disambiguating information is required and how useful the sceneâs complexity can be in providing, for example, helpful landmarks. To do so, we measure speech onset times, eye gaze, and utterance content in a reference production experiment in which the target object is either unique or non-unique in a visual scene of varying size and complexity. Speakers delay speech onset if the target object is non-unique and requires disambiguation, and we argue that this reflects the cost of deciding on a high-level strategy for describing it. The eye-tracking data demonstrates that these delays increase when the speaker is able to conduct an extensive early visual search, implying that when a speaker scans too little of the scene early on, they may decide to begin speaking before becoming aware that their description is underspecified. Speak- ersâ content choices reflect the visual makeup of the sceneâthe number of distractors present and the availability of useful landmarks. Our results highlight the complex role of visual perception in reference production, showing that speakers can make good use of complexity in ways that reflect their visual processing of the scene
Zoom : a corpus of natural language descriptions of map locations
This paper describes an experiment to elicit referring expressions from human subjects for research in natural language generation and related fields, and preliminary results of a computational model for the generation of these expressions. Unlike existing resources of this kind, the resulting data set -the Zoom corpus of natural language descriptions of map locations- takes into account a domain that is significantly closer to real-world applications than what has been considered in previous work, and addresses more complex situations of reference, including contexts with different levels of detail, and instances of singular and plural reference produced by speakers of Spanish and Portuguese.Fil: Altamirano, Ivana Romina. Universidad Nacional de CĂłrdoba, Facultad de MatemĂĄtica, AstronomĂa y FĂsica; Argentina.Fil: Ferreira, Thiago. Universidade de SĂŁo Paulo. Escola de Artes, CiĂȘncias e Humanidades; Brasil.Fil: Paraboni, IvandrĂ©. Universidade de SĂŁo Paulo. Escola de Artes, CiĂȘncias e Humanidades; Brasil.Fil: Benotti, Luciana . Universidad Nacional de CĂłrdoba, Facultad de MatemĂĄtica, AstronomĂa y FĂsica; Argentina.Ciencias de la ComputaciĂł
Spatial Relations and Natural-Language Semantics for Indoor Scenes
Over the past 15 years, there have been increased efforts to represent and communicate spatial information about entities within indoor environments. Automated annotation of information about indoor environments is needed for natural-language processing tasks, such as spatially anchoring events, tracking objects in motion, scene descriptions, and interpretation of thematic places in relationship to confirmed locations. Descriptions of indoor scenes often require a fine granularity of spatial information about the meaning of natural-language spatial utterances to improve human-computer interactions and applications for the retrieval of spatial information. The development needs of these systems provide a rationale as to whyâdespite an extensive body of research in spatial cognition and spatial linguisticsâit is still necessary to investigate basic understandings of how humans conceptualize and communicate about objects and structures in indoor space. This thesis investigates the alignment of conceptual spatial relations and naturallanguage (NL) semantics in the representation of indoor space. The foundation of this work is grounded in spatial information theory as well as spatial cognition and spatial linguistics. In order to better understand how to align computational models and NL expressions about indoor space, this dissertation used an existing dataset of indoor scene descriptions to investigate patterns in entity identification, spatial relations, and spatial preposition use within vista-scale indoor settings. Three human-subject experiments were designed and conducted within virtual indoor environments. These experiments investigate alignment of human-subject NL expressions for a sub-set of conceptual spatial relations (contact, disjoint, and partof) within a controlled virtual environment. Each scene was designed to focus participant attention on a single relation depicted in the scene and elicit a spatial preposition term(s) to describe the focal relationship. The major results of this study are the identification of object and structure categories, spatial relationships, and patterns of spatial preposition use in the indoor scene descriptions that were consistent across both open response, closed response and ranking type items. There appeared to be a strong preference for describing scene objects in relation to the structural objects that bound the room depicted in the indoor scenes. Furthermore, for each of the three relations (contact, disjoint, and partof), a small set of spatial prepositions emerged that were strongly preferred by participants at statistically significant levels based on the overall frequency of response, image sorting, and ranking judgments. The use of certain spatial prepositions to describe relations between room structures suggests there may be differences in how indoor vista-scale space is understood in relation to tabletop and geographic scales. Finally, an indoor scene description corpus was developed as a product of this work, which should provide researchers with new human-subject based datasets for training NL algorithms used to generate more accurate and intuitive NL descriptions of indoor scenes
The Role of Perception in Situated Spatial Reference
This position paper set out the argument that an interesting avenue of exploration and study of universals and variation in spatial reference is to address this topic in termsa of the universals in human perception and attention and to explore how these universals impact on spatial reference across cultures and languages
Augmenting Situated Spoken Language Interaction with Listener Gaze
Collaborative task solving in a shared environment requires referential success. Human speakers follow the listenerâs behavior in order to monitor language comprehension (Clark, 1996). Furthermore, a natural language generation (NLG) system can exploit listener gaze to realize an effective interaction strategy by responding to it with verbal feedback in virtual environments (Garoufi, Staudte, Koller, & Crocker, 2016). We augment situated spoken language interaction with listener gaze and investigate its role in human-human and human-machine interactions. Firstly, we evaluate its impact on prediction of reference resolution using a mulitimodal corpus collection from virtual environments. Secondly, we explore if and how a human speaker uses listener gaze in an indoor guidance task, while spontaneously referring to real-world objects in a real environment. Thirdly, we consider an object identification task for assembly under system instruction. We developed a multimodal interactive system and two NLG systems that integrate listener gaze in the generation mechanisms. The NLG system âFeedbackâ reacts to gaze with verbal feedback, either underspecified or contrastive. The NLG system âInstallmentsâ uses gaze to incrementally refer to an object in the form of installments. Our results showed that gaze features improved the accuracy of automatic prediction of reference resolution. Further, we found that human speakers are very good at producing referring expressions, and showing listener gaze did not improve performance, but elicited more negative feedback. In contrast, we showed that an NLG system that exploits listener gaze benefits the listenerâs understanding. Specifically, combining a short, ambiguous instruction with con- trastive feedback resulted in faster interactions compared to underspecified feedback, and even outperformed following long, unambiguous instructions. Moreover, alternating the underspecified and contrastive responses in an interleaved manner led to better engagement with the system and an effcient information uptake, and resulted in equally good performance. Somewhat surprisingly, when gaze was incorporated more indirectly in the generation procedure and used to trigger installments, the non-interactive approach that outputs an instruction all at once was more effective. However, if the spatial expression was mentioned first, referring in gaze-driven installments was as efficient as following an exhaustive instruction. In sum, we provide a proof of concept that listener gaze can effectively be used in situated human-machine interaction. An assistance system using gaze cues is more attentive and adapts to listener behavior to ensure communicative success
Reference Production as Search:The Impact of Domain Size on the Production of Distinguishing Descriptions
When producing a description of a target referent in a visual context, speakers
need to choose a set of properties that distinguish it from its distractors.
Computational models of language production/generation usually model this
as a search process and predict that the time taken will increase both with
the number of distractors in a scene and with the number of properties required
to distinguish the target. These predictions are reminiscent of classic
ndings in visual search; however, unlike models of reference production, visual
search models also predict that search can become very e cient under
certain conditions, something that reference production models do not consider.
This paper investigates the predictions of these models empirically.
In two experiments, we show that the time taken to plan a referring expression
{ as re
ected by speech onset latencies { is in
uenced by distractor set
size and by the number of properties required, but this crucially depends on
the discriminability of the properties under consideration. We discuss the
implications for current models of reference production and recent work on
the role of salience in visual search.peer-reviewe
What is not where: the challenge of integrating spatial representations into deep learning architectures
This paper examines to what degree current deep learning architectures for
image caption generation capture spatial language. On the basis of the
evaluation of examples of generated captions from the literature we argue that
systems capture what objects are in the image data but not where these objects
are located: the captions generated by these systems are the output of a
language model conditioned on the output of an object detector that cannot
capture fine-grained location information. Although language models provide
useful knowledge for image captions, we argue that deep learning image
captioning architectures should also model geometric relations between objects.Comment: 15 pages, 10 figures, Appears in CLASP Papers in Computational
Linguistics Vol 1: Proceedings of the Conference on Logic and Machine
Learning in Natural Language (LaML 2017), pp. 41-5
What Is Not Where: the Challenge of Integrating Spatial Representations Into Deep Learning Architectures
This paper examines to what degree current deep learning architectures for image caption generation capture spatial lan- guage. On the basis of the evaluation of examples of generated captions from the literature we argue that systems capture what objects are in the image data but not where these objects are located: the cap- tions generated by these systems are the output of a language model conditioned on the output of an object detector that cannot capture fine-grained location information. Although language models provide useful knowledge for image captions, we argue that deep learning image captioning architectures should also model geometric rela- tions between objects
- âŠ