104 research outputs found

    A transparent framework towards the context-sensitive recognition of conversational engagement

    Get PDF
    Modelling and recognising affective and mental user states is an urging topic in multiple research fields. This work suggests an approach towards adequate recognition of such states by combining state-of-the-art behaviour recognition classifiers in a transparent and explainable modelling framework that also allows to consider contextual aspects in the inference process. More precisely, in this paper we exemplify the idea of our framework with the recognition of conversational engagement in bi-directional conversations. We introduce a multi-modal annotation scheme for conversational engagement. We further introduce our hybrid approach that combines the accuracy of state-of-the art machine learning techniques, such as deep learning, with the capabilities of Bayesian Networks that are inherently interpretable and feature an important aspect that modern approaches are lacking - causal inference. In an evaluation on a large multi-modal corpus of bi-directional conversations, we show that this hybrid approach can even outperform state-of-the-art black-box approaches by considering context information and causal relations

    Modeling Speaker-Listener Interaction for Backchannel Prediction

    Full text link
    We present our latest findings on backchannel modeling novelly motivated by the canonical use of the minimal responses Yeah and Uh-huh in English and their correspondent tokens in German, and the effect of encoding the speaker-listener interaction. Backchanneling theories emphasize the active and continuous role of the listener in the course of the conversation, their effects on the speaker's subsequent talk, and the consequent dynamic speaker-listener interaction. Therefore, we propose a neural-based acoustic backchannel classifier on minimal responses by processing acoustic features from the speaker speech, capturing and imitating listeners' backchanneling behavior, and encoding speaker-listener interaction. Our experimental results on the Switchboard and GECO datasets reveal that in almost all tested scenarios the speaker or listener behavior embeddings help the model make more accurate backchannel predictions. More importantly, a proper interaction encoding strategy, i.e., combining the speaker and listener embeddings, leads to the best performance on both datasets in terms of F1-score.Comment: Published in IWSDS 202

    What Makes Conversation Good? How Responsivity, Topics, and Insider Language Predict Feelings of Connection

    Get PDF
    We spend our lives having conversations, and some inevitably go better than others. What happens in conversation that makes people feel connected? To explore this question, I recorded pairs of strangers and friends having unstructured conversations. In Chapter 1, I show that people who feel connected tend to respond quickly, creating short gaps between turns. However, long gaps are not necessarily bad. Although long gaps signal moments of disconnection and awkwardness for strangers, they mark moments of heightened connection for friends by providing space for enjoyment and mutual reflection. In Chapter 2, I examine how people start their conversations. Specifically, how do people who have never met before initiate their first interaction? And how do these approaches differ from people who are already robustly connected? I find that strangers start their conversations more similarly to each other, compared to friends. In particular, strangers tend to start with topics that can easily launch into many different topics, increasing the likelihood of finding common ground. Friends do not need to rely on this strategy and can instead immediately start their conversations with topics idiosyncratic to their relationship. In Chapter 3, I highlight another fundamental difference in how friends and strangers communicate by exploring the use of insider language, or words carrying specific meaning between some people but not others. I find that friends use insider language more than strangers and when they do, they feel more connected. When people know each other well, communication can move from spoken words to shared thoughts. Together, these findings reveal that people feel closer when they can respond quickly in conversation and feel comfortable not speaking, and that being able to jump right into a conversation and communicate using shorthand are hallmarks of friendship

    Head movement in conversation

    Get PDF
    This work explores the function and form of head movement and specifically head nods in free conversation. It opens with a comparison of three theories that are often considered as triggers for head nods: mimicry, backchannel responses, and responses to speakers' trouble. Early in this work it is assumed that head nods are well defined in terms of movement, and that they can be directly attributed, or at least better explained, by one theory compared to the others. To test that, comparisons between the theories are conducted following two different approaches. In one set of experiments a novel virtual reality method enables the analysis of perceived plausibility of head nods generated by models inspired by these theories. The results suggest that participants could not consciously assess differences between the predictions of the different theories. In part, this is due to a mixture of gamification and study design challenges. In addition, these experiments raise the question of whether or not it is reasonable to expect people to consciously process and report issues with the non-verbal behaviour of their conversational partners. In a second set of experiments the predictions of the theories are compared directly to head nods that are automatically detected from motion capture data. Matching the predictions with automatically detected head nods showed that not only are most predictions wrong, but also that most of the detected head nods are not accounted by any of the theories under question. Whereas these experiments do not adequately answer which theory best describe head nods in conversation, they suggest new avenues to explore: are head nods well defined in the sense that multiple people will agree that a specific motion is a head nod? and if so, what are their movement characteristics and what is their reliance on conversational context? Exploring these questions revealed a complex picture of what people consider to be head nods and their reliance on context. First, the agreement on what is a head nod is moderate, even when annotators are presented with video snippets that include only automatically detected nods. Second, head nods share movement characteristics with other behaviours, specifically laughter. Lastly, head nods are more accurately defined by their semantic characteristics than by their movement properties, suggesting that future detectors should incorporate more contextual features than movement alone. Overall, this thesis questions the coherence of our intuitive notion of a head nod and the adequacy of current approaches to describe the movements involved. It shows how some of the common theories that describe head movement and nods fail to explain most head movement in free conversation. In addition, it highlights subtleties in head movement and nods that are often overlooked. The findings from this work can inform the development of future head nods detection approaches, and provide a better understanding of non-verbal communication in general

    Towards an architectural framework for intelligent virtual agents using probabilistic programming

    Full text link
    We present a new framework called KorraAI for conceiving and building embodied conversational agents (ECAs). Our framework models ECAs' behavior considering contextual information, for example, about environment and interaction time, and uncertain information provided by the human interaction partner. Moreover, agents built with KorraAI can show proactive behavior, as they can initiate interactions with human partners. For these purposes, KorraAI exploits probabilistic programming. Probabilistic models in KorraAI are used to model its behavior and interactions with the user. They enable adaptation to the user's preferences and a certain degree of indeterminism in the ECAs to achieve more natural behavior. Human-like internal states, such as moods, preferences, and emotions (e.g., surprise), can be modeled in KorraAI with distributions and Bayesian networks. These models can evolve over time, even without interaction with the user. ECA models are implemented as plugins and share a common interface. This enables ECA designers to focus more on the character they are modeling and less on the technical details, as well as to store and exchange ECA models. Several applications of KorraAI ECAs are possible, such as virtual sales agents, customer service agents, virtual companions, entertainers, or tutors

    Building Embodied Conversational Agents:Observations on human nonverbal behaviour as a resource for the development of artificial characters

    Get PDF
    "Wow this is so cool!" This is what I most probably yelled, back in the 90s, when my first computer program on our MSX computer turned out to do exactly what I wanted it to do. The program contained the following instruction: COLOR 10(1.1) After hitting enter, it would change the screen color from light blue to dark yellow. A few years after that experience, Microsoft Windows was introduced. Windows came with an intuitive graphical user interface that was designed to allow all people, so also those who would not consider themselves to be experienced computer addicts, to interact with the computer. This was a major step forward in human-computer interaction, as from that point forward no complex programming skills were required anymore to perform such actions as adapting the screen color. Changing the background was just a matter of pointing the mouse to the desired color on a color palette. "Wow this is so cool!". This is what I shouted, again, 20 years later. This time my new smartphone successfully skipped to the next song on Spotify because I literally told my smartphone, with my voice, to do so. Being able to operate your smartphone with natural language through voice-control can be extremely handy, for instance when listening to music while showering. Again, the option to handle a computer with voice instructions turned out to be a significant optimization in human-computer interaction. From now on, computers could be instructed without the use of a screen, mouse or keyboard, and instead could operate successfully simply by telling the machine what to do. In other words, I have personally witnessed how, within only a few decades, the way people interact with computers has changed drastically, starting as a rather technical and abstract enterprise to becoming something that was both natural and intuitive, and did not require any advanced computer background. Accordingly, while computers used to be machines that could only be operated by technically-oriented individuals, they had gradually changed into devices that are part of many people’s household, just as much as a television, a vacuum cleaner or a microwave oven. The introduction of voice control is a significant feature of the newer generation of interfaces in the sense that these have become more "antropomorphic" and try to mimic the way people interact in daily life, where indeed the voice is a universally used device that humans exploit in their exchanges with others. The question then arises whether it would be possible to go even one step further, where people, like in science-fiction movies, interact with avatars or humanoid robots, whereby users can have a proper conversation with a computer-simulated human that is indistinguishable from a real human. An interaction with a human-like representation of a computer that behaves, talks and reacts like a real person would imply that the computer is able to not only produce and understand messages transmitted auditorily through the voice, but also could rely on the perception and generation of different forms of body language, such as facial expressions, gestures or body posture. At the time of writing, developments of this next step in human-computer interaction are in full swing, but the type of such interactions is still rather constrained when compared to the way humans have their exchanges with other humans. It is interesting to reflect on how such future humanmachine interactions may look like. When we consider other products that have been created in history, it sometimes is striking to see that some of these have been inspired by things that can be observed in our environment, yet at the same do not have to be exact copies of those phenomena. For instance, an airplane has wings just as birds, yet the wings of an airplane do not make those typical movements a bird would produce to fly. Moreover, an airplane has wheels, whereas a bird has legs. At the same time, an airplane has made it possible for a humans to cover long distances in a fast and smooth manner in a way that was unthinkable before it was invented. The example of the airplane shows how new technologies can have "unnatural" properties, but can nonetheless be very beneficial and impactful for human beings. This dissertation centers on this practical question of how virtual humans can be programmed to act more human-like. The four studies presented in this dissertation all have the equivalent underlying question of how parts of human behavior can be captured, such that computers can use it to become more human-like. Each study differs in method, perspective and specific questions, but they are all aimed to gain insights and directions that would help further push the computer developments of human-like behavior and investigate (the simulation of) human conversational behavior. The rest of this introductory chapter gives a general overview of virtual humans (also known as embodied conversational agents), their potential uses and the engineering challenges, followed by an overview of the four studies

    Building Embodied Conversational Agents:Observations on human nonverbal behaviour as a resource for the development of artificial characters

    Get PDF
    "Wow this is so cool!" This is what I most probably yelled, back in the 90s, when my first computer program on our MSX computer turned out to do exactly what I wanted it to do. The program contained the following instruction: COLOR 10(1.1) After hitting enter, it would change the screen color from light blue to dark yellow. A few years after that experience, Microsoft Windows was introduced. Windows came with an intuitive graphical user interface that was designed to allow all people, so also those who would not consider themselves to be experienced computer addicts, to interact with the computer. This was a major step forward in human-computer interaction, as from that point forward no complex programming skills were required anymore to perform such actions as adapting the screen color. Changing the background was just a matter of pointing the mouse to the desired color on a color palette. "Wow this is so cool!". This is what I shouted, again, 20 years later. This time my new smartphone successfully skipped to the next song on Spotify because I literally told my smartphone, with my voice, to do so. Being able to operate your smartphone with natural language through voice-control can be extremely handy, for instance when listening to music while showering. Again, the option to handle a computer with voice instructions turned out to be a significant optimization in human-computer interaction. From now on, computers could be instructed without the use of a screen, mouse or keyboard, and instead could operate successfully simply by telling the machine what to do. In other words, I have personally witnessed how, within only a few decades, the way people interact with computers has changed drastically, starting as a rather technical and abstract enterprise to becoming something that was both natural and intuitive, and did not require any advanced computer background. Accordingly, while computers used to be machines that could only be operated by technically-oriented individuals, they had gradually changed into devices that are part of many people’s household, just as much as a television, a vacuum cleaner or a microwave oven. The introduction of voice control is a significant feature of the newer generation of interfaces in the sense that these have become more "antropomorphic" and try to mimic the way people interact in daily life, where indeed the voice is a universally used device that humans exploit in their exchanges with others. The question then arises whether it would be possible to go even one step further, where people, like in science-fiction movies, interact with avatars or humanoid robots, whereby users can have a proper conversation with a computer-simulated human that is indistinguishable from a real human. An interaction with a human-like representation of a computer that behaves, talks and reacts like a real person would imply that the computer is able to not only produce and understand messages transmitted auditorily through the voice, but also could rely on the perception and generation of different forms of body language, such as facial expressions, gestures or body posture. At the time of writing, developments of this next step in human-computer interaction are in full swing, but the type of such interactions is still rather constrained when compared to the way humans have their exchanges with other humans. It is interesting to reflect on how such future humanmachine interactions may look like. When we consider other products that have been created in history, it sometimes is striking to see that some of these have been inspired by things that can be observed in our environment, yet at the same do not have to be exact copies of those phenomena. For instance, an airplane has wings just as birds, yet the wings of an airplane do not make those typical movements a bird would produce to fly. Moreover, an airplane has wheels, whereas a bird has legs. At the same time, an airplane has made it possible for a humans to cover long distances in a fast and smooth manner in a way that was unthinkable before it was invented. The example of the airplane shows how new technologies can have "unnatural" properties, but can nonetheless be very beneficial and impactful for human beings. This dissertation centers on this practical question of how virtual humans can be programmed to act more human-like. The four studies presented in this dissertation all have the equivalent underlying question of how parts of human behavior can be captured, such that computers can use it to become more human-like. Each study differs in method, perspective and specific questions, but they are all aimed to gain insights and directions that would help further push the computer developments of human-like behavior and investigate (the simulation of) human conversational behavior. The rest of this introductory chapter gives a general overview of virtual humans (also known as embodied conversational agents), their potential uses and the engineering challenges, followed by an overview of the four studies

    Simulating realistic multiparty speech data: for the development of distant microphone ASR systems

    Get PDF
    Automatic speech recognition has become a ubiquitous technology integrated into our daily lives. However, the problem remains challenging when the speaker is far away from the microphone. In such scenarios, the speech is degraded both by reverberation and by the presence of additive noise. This situation is particularly challenging when there are competing speakers present (i.e. multi-party scenarios) Acoustic scene simulation has been a major tool for training and developing distant microphone speech recognition systems, and is now being used to develop solutions for mult-party scenarios. It has been used both in training -- as it allows cheap generation of limitless amounts of data -- and for evaluation -- because it can provide easy access to a ground truth (i.e. a noise-free target signal). However, whilst much work has been conducted to produce realistic artificial scene simulators, the signals produced from such simulators are only as good as the `metadata' being used to define the setups, i.e., the data describing, for example, the number of speakers and their distribution relative to the microphones. This thesis looks at how realistic metadata can be derived by analysing how speakers behave in real domestic environments. In particular, how to produce scenes that provide a realistic distribution for various factors that are known to influence the 'difficulty' of the scene, including the separation angle between speakers, the absolute and relative distances of speakers to microphones, and the pattern of temporal overlap of speech. Using an existing audio-visual multi-party conversational dataset, CHiME-5, each of these aspects has been studied in turn. First, producing a realistic angular separation between speakers allows for algorithms which enhance signals based on the direction of arrival to be fairly evaluated, reducing the mismatch between real and simulated data. This was estimated using automatic people detection techniques in video recordings from CHiME-5. Results show that commonly used datasets of simulated signals do not follow a realistic distribution, and when a realistic distribution is enforced, a significant drop in performance is observed. Second, by using multiple cameras it has been possible to estimate the 2-D positions of people inside each scene. This has allowed the estimation of realistic distributions for the absolute distance to the microphone and relative distance to the competing speaker. The results show grouping behaviour among participants when located in a room and the impact this has on performance depends on the room size considered. Finally, the amount of overlap and points in the mixture which contain overlap were explored using finite-state models. These models allowed for mixtures to be generated, which approached the overlap patterns observed in the real data. Features derived from these models were also shown to be a predictor of the difficulty of the mixture. At each stage of the project, simulated datasets derived using the realistic metadata distributions have been compared to existing standard datasets that use naive or uninformed metadata distributions, and implications for speech recognition performance are observed and discussed. This work has demonstrated how unrealistic approaches can produce over-promising results, and can bias research towards techniques that might not work well in practice. Results will also be valuable in informing the design of future simulated datasets

    A review of theories and methods in the science of face-to-face social interaction

    Get PDF
    For most of human history, face-to-face interactions have been the primary and most fundamental way to build social relationships, and even in the digital era they remain the basis of our closest bonds. These interactions are built on the dynamic integration and coordination of verbal and non-verbal information between multiple people. However, the psychological processes underlying face-to-face interaction remain difficult to study. In this Review, we discuss three ways the multimodal phenomena underlying face-to-face social interaction can be organized to provide a solid basis for theory development. Next, we review three types of theory of social interaction: theories that focus on the social meaning of actions, theories that explain actions in terms of simple behaviour rules and theories that rely on rich cognitive models of the internal states of others. Finally, we address how different methods can be used to distinguish between theories, showcasing new approaches and outlining important directions for future research. Advances in how face-to-face social interaction can be studied, combined with a renewed focus on cognitive theories, could lead to a renaissance in social interaction research and advance scientific understanding of face-to-face interaction and its underlying cognitive foundations
    • …
    corecore