2,810 research outputs found

    Spoken Language Interaction with Robots: Recommendations for Future Research

    Get PDF
    With robotics rapidly advancing, more effective human–robot interaction is increasingly needed to realize the full potential of robots for society. While spoken language must be part of the solution, our ability to provide spoken language interaction capabilities is still very limited. In this article, based on the report of an interdisciplinary workshop convened by the National Science Foundation, we identify key scientific and engineering advances needed to enable effective spoken language interaction with robotics. We make 25 recommendations, involving eight general themes: putting human needs first, better modeling the social and interactive aspects of language, improving robustness, creating new methods for rapid adaptation, better integrating speech and language with other communication modalities, giving speech and language components access to rich representations of the robot’s current knowledge and state, making all components operate in real time, and improving research infrastructure and resources. Research and development that prioritizes these topics will, we believe, provide a solid foundation for the creation of speech-capable robots that are easy and effective for humans to work with

    Early Turn-taking Prediction with Spiking Neural Networks for Human Robot Collaboration

    Full text link
    Turn-taking is essential to the structure of human teamwork. Humans are typically aware of team members' intention to keep or relinquish their turn before a turn switch, where the responsibility of working on a shared task is shifted. Future co-robots are also expected to provide such competence. To that end, this paper proposes the Cognitive Turn-taking Model (CTTM), which leverages cognitive models (i.e., Spiking Neural Network) to achieve early turn-taking prediction. The CTTM framework can process multimodal human communication cues (both implicit and explicit) and predict human turn-taking intentions in an early stage. The proposed framework is tested on a simulated surgical procedure, where a robotic scrub nurse predicts the surgeon's turn-taking intention. It was found that the proposed CTTM framework outperforms the state-of-the-art turn-taking prediction algorithms by a large margin. It also outperforms humans when presented with partial observations of communication cues (i.e., less than 40% of full actions). This early prediction capability enables robots to initiate turn-taking actions at an early stage, which facilitates collaboration and increases overall efficiency.Comment: Submitted to IEEE International Conference on Robotics and Automation (ICRA) 201

    Building Embodied Conversational Agents:Observations on human nonverbal behaviour as a resource for the development of artificial characters

    Get PDF
    "Wow this is so cool!" This is what I most probably yelled, back in the 90s, when my first computer program on our MSX computer turned out to do exactly what I wanted it to do. The program contained the following instruction: COLOR 10(1.1) After hitting enter, it would change the screen color from light blue to dark yellow. A few years after that experience, Microsoft Windows was introduced. Windows came with an intuitive graphical user interface that was designed to allow all people, so also those who would not consider themselves to be experienced computer addicts, to interact with the computer. This was a major step forward in human-computer interaction, as from that point forward no complex programming skills were required anymore to perform such actions as adapting the screen color. Changing the background was just a matter of pointing the mouse to the desired color on a color palette. "Wow this is so cool!". This is what I shouted, again, 20 years later. This time my new smartphone successfully skipped to the next song on Spotify because I literally told my smartphone, with my voice, to do so. Being able to operate your smartphone with natural language through voice-control can be extremely handy, for instance when listening to music while showering. Again, the option to handle a computer with voice instructions turned out to be a significant optimization in human-computer interaction. From now on, computers could be instructed without the use of a screen, mouse or keyboard, and instead could operate successfully simply by telling the machine what to do. In other words, I have personally witnessed how, within only a few decades, the way people interact with computers has changed drastically, starting as a rather technical and abstract enterprise to becoming something that was both natural and intuitive, and did not require any advanced computer background. Accordingly, while computers used to be machines that could only be operated by technically-oriented individuals, they had gradually changed into devices that are part of many people’s household, just as much as a television, a vacuum cleaner or a microwave oven. The introduction of voice control is a significant feature of the newer generation of interfaces in the sense that these have become more "antropomorphic" and try to mimic the way people interact in daily life, where indeed the voice is a universally used device that humans exploit in their exchanges with others. The question then arises whether it would be possible to go even one step further, where people, like in science-fiction movies, interact with avatars or humanoid robots, whereby users can have a proper conversation with a computer-simulated human that is indistinguishable from a real human. An interaction with a human-like representation of a computer that behaves, talks and reacts like a real person would imply that the computer is able to not only produce and understand messages transmitted auditorily through the voice, but also could rely on the perception and generation of different forms of body language, such as facial expressions, gestures or body posture. At the time of writing, developments of this next step in human-computer interaction are in full swing, but the type of such interactions is still rather constrained when compared to the way humans have their exchanges with other humans. It is interesting to reflect on how such future humanmachine interactions may look like. When we consider other products that have been created in history, it sometimes is striking to see that some of these have been inspired by things that can be observed in our environment, yet at the same do not have to be exact copies of those phenomena. For instance, an airplane has wings just as birds, yet the wings of an airplane do not make those typical movements a bird would produce to fly. Moreover, an airplane has wheels, whereas a bird has legs. At the same time, an airplane has made it possible for a humans to cover long distances in a fast and smooth manner in a way that was unthinkable before it was invented. The example of the airplane shows how new technologies can have "unnatural" properties, but can nonetheless be very beneficial and impactful for human beings. This dissertation centers on this practical question of how virtual humans can be programmed to act more human-like. The four studies presented in this dissertation all have the equivalent underlying question of how parts of human behavior can be captured, such that computers can use it to become more human-like. Each study differs in method, perspective and specific questions, but they are all aimed to gain insights and directions that would help further push the computer developments of human-like behavior and investigate (the simulation of) human conversational behavior. The rest of this introductory chapter gives a general overview of virtual humans (also known as embodied conversational agents), their potential uses and the engineering challenges, followed by an overview of the four studies

    Building Embodied Conversational Agents:Observations on human nonverbal behaviour as a resource for the development of artificial characters

    Get PDF
    "Wow this is so cool!" This is what I most probably yelled, back in the 90s, when my first computer program on our MSX computer turned out to do exactly what I wanted it to do. The program contained the following instruction: COLOR 10(1.1) After hitting enter, it would change the screen color from light blue to dark yellow. A few years after that experience, Microsoft Windows was introduced. Windows came with an intuitive graphical user interface that was designed to allow all people, so also those who would not consider themselves to be experienced computer addicts, to interact with the computer. This was a major step forward in human-computer interaction, as from that point forward no complex programming skills were required anymore to perform such actions as adapting the screen color. Changing the background was just a matter of pointing the mouse to the desired color on a color palette. "Wow this is so cool!". This is what I shouted, again, 20 years later. This time my new smartphone successfully skipped to the next song on Spotify because I literally told my smartphone, with my voice, to do so. Being able to operate your smartphone with natural language through voice-control can be extremely handy, for instance when listening to music while showering. Again, the option to handle a computer with voice instructions turned out to be a significant optimization in human-computer interaction. From now on, computers could be instructed without the use of a screen, mouse or keyboard, and instead could operate successfully simply by telling the machine what to do. In other words, I have personally witnessed how, within only a few decades, the way people interact with computers has changed drastically, starting as a rather technical and abstract enterprise to becoming something that was both natural and intuitive, and did not require any advanced computer background. Accordingly, while computers used to be machines that could only be operated by technically-oriented individuals, they had gradually changed into devices that are part of many people’s household, just as much as a television, a vacuum cleaner or a microwave oven. The introduction of voice control is a significant feature of the newer generation of interfaces in the sense that these have become more "antropomorphic" and try to mimic the way people interact in daily life, where indeed the voice is a universally used device that humans exploit in their exchanges with others. The question then arises whether it would be possible to go even one step further, where people, like in science-fiction movies, interact with avatars or humanoid robots, whereby users can have a proper conversation with a computer-simulated human that is indistinguishable from a real human. An interaction with a human-like representation of a computer that behaves, talks and reacts like a real person would imply that the computer is able to not only produce and understand messages transmitted auditorily through the voice, but also could rely on the perception and generation of different forms of body language, such as facial expressions, gestures or body posture. At the time of writing, developments of this next step in human-computer interaction are in full swing, but the type of such interactions is still rather constrained when compared to the way humans have their exchanges with other humans. It is interesting to reflect on how such future humanmachine interactions may look like. When we consider other products that have been created in history, it sometimes is striking to see that some of these have been inspired by things that can be observed in our environment, yet at the same do not have to be exact copies of those phenomena. For instance, an airplane has wings just as birds, yet the wings of an airplane do not make those typical movements a bird would produce to fly. Moreover, an airplane has wheels, whereas a bird has legs. At the same time, an airplane has made it possible for a humans to cover long distances in a fast and smooth manner in a way that was unthinkable before it was invented. The example of the airplane shows how new technologies can have "unnatural" properties, but can nonetheless be very beneficial and impactful for human beings. This dissertation centers on this practical question of how virtual humans can be programmed to act more human-like. The four studies presented in this dissertation all have the equivalent underlying question of how parts of human behavior can be captured, such that computers can use it to become more human-like. Each study differs in method, perspective and specific questions, but they are all aimed to gain insights and directions that would help further push the computer developments of human-like behavior and investigate (the simulation of) human conversational behavior. The rest of this introductory chapter gives a general overview of virtual humans (also known as embodied conversational agents), their potential uses and the engineering challenges, followed by an overview of the four studies

    Modelling the relationship between gesture motion and meaning

    Get PDF
    There are many ways to say “Hello,” be it a wave, a nod, or a bow. We greet others not only with words, but also with our bodies. Embodied communication permeates our interactions. A fist bump, thumbs-up, or pat on the back can be even more meaningful than hearing “good job!” A friend crossing their arms with a scowl, turning away from you, or stiffening up can feel like a harsh rejection. Social communication is not exclusively linguistic, but is a multi-sensory affair. It’s not that communication without these bodily cues is impossible, but it is impoverished. Embodiment is a fundamental human experience. Expressing ourselves through our bodies provides a powerful channel through which we express a plethora of meta-social information. And integral to communication, expression, and social engagement is our utilization of conversational gesture. We use gestures to express extra-linguistic information, to emphasize our point, and to embody mental and linguistic metaphors that add depth and color to social interaction. The gesture behaviour of virtual humans when compared to human-human conversation is limited, depending on the approach taken to automate performances of these characters. The generation of nonverbal behaviour for virtual humans can be approximately classified as either: 1) data-driven approaches that learn a mapping from aspects of the verbal channel, such as prosody, to gestures; or 2) rule bases approaches that are often tailored by designers for specific applications. This thesis is an interdisciplinary exploration that bridges these two approaches, and brings data-driven analyses to observational gesture research. By marrying a rich history of gesture research in behavioral psychology with data-driven techniques, this body of work brings rigorous computational methods to gesture classification, analysis, and generation. It addresses how researchers can exploit computational methods to make virtual humans gesture with the same richness, complexity, and apparent effortlessness as you and I. Throughout this work the central focus is on metaphoric gestures. These gestures are capable of conveying rich, nuanced, multi-dimensional meaning, and raise several challenges in their generation, including establishing and interpreting a gesture’s communicative meaning, and selecting a performance to convey it. As such, effectively utilizing these gestures remains an open challenge in virtual agent research. This thesis explores how metaphoric gestures are interpreted by an observer, how one can generate such rich gestures using a mapping between utterance meaning and gesture, as well as how one can use data driven techniques to explore the mapping between utterance and metaphoric gestures. The thesis begins in Chapter 1 by outlining the interdisciplinary space of gesture research in psychology and generation in virtual agents. It then presents several studies that address presupposed assumptions raised about the need for rich, metaphoric gestures and the risk of false implicature when gestural meaning is ignored in gesture generation. In Chapter 2, two studies on metaphoric gestures that embody multiple metaphors argue three critical points that inform the rest of the thesis: that people form rich inferences from metaphoric gestures, these inferences are informed by cultural context and, more importantly, that any approach to analyzing the relation between utterance and metaphoric gesture needs to take into account that multiple metaphors may be conveyed by a single gesture. A third study presented in Chapter 3 highlights the risk of false implicature and discusses this in the context of current subjective evaluations of the qualitative influence of gesture on viewers. Chapters 4 and 5 then present a data-driven analysis approach to recovering an interpretable explicit mapping from utterance to metaphor. The approach described in detail in Chapter 4 clusters gestural motion and relates those clusters to the semantic analysis of associated utterance. Then, Chapter 5 demonstrates how this approach can be used both as a framework for data-driven techniques in the study of gesture as well as form the basis of a gesture generation approach for virtual humans. The framework used in the last two chapters ties together the main themes of this thesis: how we can use observational behavioral gesture research to inform data-driven analysis methods, how embodied metaphor relates to fine-grained gestural motion, and how to exploit this relationship to generate rich, communicatively nuanced gestures on virtual agents. While gestures show huge variation, the goal of this thesis is to start to characterize and codify that variation using modern data-driven techniques. The final chapter of this thesis reflects on the many challenges and obstacles the field of gesture generation continues to face. The potential for applications of Virtual Agents to have broad impacts on our daily lives increases with the growing pervasiveness of digital interfaces, technical breakthroughs, and collaborative interdisciplinary research efforts. It concludes with an optimistic vision of applications for virtual agents with deep models of non-verbal social behaviour and their potential to encourage multi-disciplinary collaboration

    ICMI'12:Proceedings of the ACM SIGCHI 14th International Conference on Multimodal Interaction

    Get PDF
    corecore