55 research outputs found
An End-to-End Conversational Style Matching Agent
We present an end-to-end voice-based conversational agent that is able to
engage in naturalistic multi-turn dialogue and align with the interlocutor's
conversational style. The system uses a series of deep neural network
components for speech recognition, dialogue generation, prosodic analysis and
speech synthesis to generate language and prosodic expression with qualities
that match those of the user. We conducted a user study (N=30) in which
participants talked with the agent for 15 to 20 minutes, resulting in over 8
hours of natural interaction data. Users with high consideration conversational
styles reported the agent to be more trustworthy when it matched their
conversational style. Whereas, users with high involvement conversational
styles were indifferent. Finally, we provide design guidelines for multi-turn
dialogue interactions using conversational style adaptation
Building Embodied Conversational Agents:Observations on human nonverbal behaviour as a resource for the development of artificial characters
"Wow this is so cool!" This is what I most probably yelled, back in the 90s, when my first computer program on our MSX computer turned out to do exactly what I wanted it to do. The program contained the following instruction: COLOR 10(1.1) After hitting enter, it would change the screen color from light blue to dark yellow. A few years after that experience, Microsoft Windows was introduced. Windows came with an intuitive graphical user interface that was designed to allow all people, so also those who would not consider themselves to be experienced computer addicts, to interact with the computer. This was a major step forward in human-computer interaction, as from that point forward no complex programming skills were required anymore to perform such actions as adapting the screen color. Changing the background was just a matter of pointing the mouse to the desired color on a color palette. "Wow this is so cool!". This is what I shouted, again, 20 years later. This time my new smartphone successfully skipped to the next song on Spotify because I literally told my smartphone, with my voice, to do so. Being able to operate your smartphone with natural language through voice-control can be extremely handy, for instance when listening to music while showering. Again, the option to handle a computer with voice instructions turned out to be a significant optimization in human-computer interaction. From now on, computers could be instructed without the use of a screen, mouse or keyboard, and instead could operate successfully simply by telling the machine what to do. In other words, I have personally witnessed how, within only a few decades, the way people interact with computers has changed drastically, starting as a rather technical and abstract enterprise to becoming something that was both natural and intuitive, and did not require any advanced computer background. Accordingly, while computers used to be machines that could only be operated by technically-oriented individuals, they had gradually changed into devices that are part of many peopleâs household, just as much as a television, a vacuum cleaner or a microwave oven. The introduction of voice control is a significant feature of the newer generation of interfaces in the sense that these have become more "antropomorphic" and try to mimic the way people interact in daily life, where indeed the voice is a universally used device that humans exploit in their exchanges with others. The question then arises whether it would be possible to go even one step further, where people, like in science-fiction movies, interact with avatars or humanoid robots, whereby users can have a proper conversation with a computer-simulated human that is indistinguishable from a real human. An interaction with a human-like representation of a computer that behaves, talks and reacts like a real person would imply that the computer is able to not only produce and understand messages transmitted auditorily through the voice, but also could rely on the perception and generation of different forms of body language, such as facial expressions, gestures or body posture. At the time of writing, developments of this next step in human-computer interaction are in full swing, but the type of such interactions is still rather constrained when compared to the way humans have their exchanges with other humans. It is interesting to reflect on how such future humanmachine interactions may look like. When we consider other products that have been created in history, it sometimes is striking to see that some of these have been inspired by things that can be observed in our environment, yet at the same do not have to be exact copies of those phenomena. For instance, an airplane has wings just as birds, yet the wings of an airplane do not make those typical movements a bird would produce to fly. Moreover, an airplane has wheels, whereas a bird has legs. At the same time, an airplane has made it possible for a humans to cover long distances in a fast and smooth manner in a way that was unthinkable before it was invented. The example of the airplane shows how new technologies can have "unnatural" properties, but can nonetheless be very beneficial and impactful for human beings. This dissertation centers on this practical question of how virtual humans can be programmed to act more human-like. The four studies presented in this dissertation all have the equivalent underlying question of how parts of human behavior can be captured, such that computers can use it to become more human-like. Each study differs in method, perspective and specific questions, but they are all aimed to gain insights and directions that would help further push the computer developments of human-like behavior and investigate (the simulation of) human conversational behavior. The rest of this introductory chapter gives a general overview of virtual humans (also known as embodied conversational agents), their potential uses and the engineering challenges, followed by an overview of the four studies
Designing talk in social networks: What Facebook teaches about conversation
The easy accessibility, ubiquity, and plurilingualism of popular SNSs such as Facebook have inspired many scholars and practitioners of second language teaching and learning to integrate networked forms of communication into educational contexts such as language classrooms and study abroad programs (e.g., Blattner & Fiori, 2011; Lamy & Zourou, 2013; Mills, 2011; Reinhardt & Ryu, 2013; Reinhardt & Zander, 2011). At the same time, the complex and dynamic patterns of interaction that emerge in these spaces quickly push back upon standard ways of describing conversational genres and communicative competence (Kern, 2014; Lotherington & Ronda, 2014). Drawing from an ecological interactional analysis (Goffman, 1964, 1981a, 1981b, 1986; Kramsch & Whiteside, 2008) of the Facebook communications of three German-speaking academics whose social and professional lives are largely led in English, the authors consider the kinds of symbolic maneuvers required to participate in the translingual conversational flows of SNS-mediated communication. Based on this analysis, this article argues that texts generated through SNS-mediated communication can provide classroom opportunities for critical, stylistically sensitive reflection on the nature of talk in line with multiliteracies approaches
Building Embodied Conversational Agents:Observations on human nonverbal behaviour as a resource for the development of artificial characters
"Wow this is so cool!" This is what I most probably yelled, back in the 90s, when my first computer program on our MSX computer turned out to do exactly what I wanted it to do. The program contained the following instruction: COLOR 10(1.1) After hitting enter, it would change the screen color from light blue to dark yellow. A few years after that experience, Microsoft Windows was introduced. Windows came with an intuitive graphical user interface that was designed to allow all people, so also those who would not consider themselves to be experienced computer addicts, to interact with the computer. This was a major step forward in human-computer interaction, as from that point forward no complex programming skills were required anymore to perform such actions as adapting the screen color. Changing the background was just a matter of pointing the mouse to the desired color on a color palette. "Wow this is so cool!". This is what I shouted, again, 20 years later. This time my new smartphone successfully skipped to the next song on Spotify because I literally told my smartphone, with my voice, to do so. Being able to operate your smartphone with natural language through voice-control can be extremely handy, for instance when listening to music while showering. Again, the option to handle a computer with voice instructions turned out to be a significant optimization in human-computer interaction. From now on, computers could be instructed without the use of a screen, mouse or keyboard, and instead could operate successfully simply by telling the machine what to do. In other words, I have personally witnessed how, within only a few decades, the way people interact with computers has changed drastically, starting as a rather technical and abstract enterprise to becoming something that was both natural and intuitive, and did not require any advanced computer background. Accordingly, while computers used to be machines that could only be operated by technically-oriented individuals, they had gradually changed into devices that are part of many peopleâs household, just as much as a television, a vacuum cleaner or a microwave oven. The introduction of voice control is a significant feature of the newer generation of interfaces in the sense that these have become more "antropomorphic" and try to mimic the way people interact in daily life, where indeed the voice is a universally used device that humans exploit in their exchanges with others. The question then arises whether it would be possible to go even one step further, where people, like in science-fiction movies, interact with avatars or humanoid robots, whereby users can have a proper conversation with a computer-simulated human that is indistinguishable from a real human. An interaction with a human-like representation of a computer that behaves, talks and reacts like a real person would imply that the computer is able to not only produce and understand messages transmitted auditorily through the voice, but also could rely on the perception and generation of different forms of body language, such as facial expressions, gestures or body posture. At the time of writing, developments of this next step in human-computer interaction are in full swing, but the type of such interactions is still rather constrained when compared to the way humans have their exchanges with other humans. It is interesting to reflect on how such future humanmachine interactions may look like. When we consider other products that have been created in history, it sometimes is striking to see that some of these have been inspired by things that can be observed in our environment, yet at the same do not have to be exact copies of those phenomena. For instance, an airplane has wings just as birds, yet the wings of an airplane do not make those typical movements a bird would produce to fly. Moreover, an airplane has wheels, whereas a bird has legs. At the same time, an airplane has made it possible for a humans to cover long distances in a fast and smooth manner in a way that was unthinkable before it was invented. The example of the airplane shows how new technologies can have "unnatural" properties, but can nonetheless be very beneficial and impactful for human beings. This dissertation centers on this practical question of how virtual humans can be programmed to act more human-like. The four studies presented in this dissertation all have the equivalent underlying question of how parts of human behavior can be captured, such that computers can use it to become more human-like. Each study differs in method, perspective and specific questions, but they are all aimed to gain insights and directions that would help further push the computer developments of human-like behavior and investigate (the simulation of) human conversational behavior. The rest of this introductory chapter gives a general overview of virtual humans (also known as embodied conversational agents), their potential uses and the engineering challenges, followed by an overview of the four studies
Recommended from our members
What really happens during study abroad? : an in-depth analysis of learnersâ interactions during a short-term sojourn in Spain
We often hear assumptions about students who study abroad: Students learn a foreign language faster, students interact more with native speakers, students are immersed in the target language, and host families provide endless target-language practice. Universities and private companies in the United States promote this scenario as the best way to learn a language, and this preference is reflected in the growing number of U.S. students studying abroad, with over 60 percent taking part in programs that last for eight or fewer weeks.
The goal of this study is to investigate these assumptions by examining the interactional practices of study-abroad students using the target language. We posed three research questions: (1) What types of interactions do students have outside of the classroom?, (2) How do learners interact with the host families?, and (3) How does technology and social media affect studentsâ immersion experience?
The results show that short-term study-abroad programs, as viable language-learning contexts, are in jeopardy. Students in this program had few meaningful interactions with people in the target community. Conversations with host families proved to be deficient in many areas (e.g., studentsâ erroneous utterances were not corrected). The output of students during mealtime conversations proved to be linguistically poor, as students used few communication strategies. Furthermore, students used their electronic devices to communicate with friends and family, and to keep up with television shows and music from the U.S. They did not use social media websites and mobile applications to communicate with people in the target community nor did they use them to interact with the target language.
Based on the results of this study, we call for a revamping of study-abroad promotion. If 100 percent immersion no longer exists, study-abroad promoters should change the way in which short-term programs are marketed. Additionally, if we want short-term study abroad to actively foster language learning, foreign language acquisition researchers, study abroad-program designers, and language instructors need to design programs centered on meaningful student-native speaker interactions. Study abroad is a tool that, if used correctly, can be the catalyst that changes the trajectory of studentsâ language-learning lives.Foreign Language Educatio
Mining Behavior of Citizen Sensor Communities to Improve Cooperation with Organizational Actors
Web 2.0 (social media) provides a natural platform for dynamic emergence of citizen (as) sensor communities, where the citizens generate content for sharing information and engaging in discussions. Such a citizen sensor community (CSC) has stated or implied goals that are helpful in the work of formal organizations, such as an emergency management unit, for prioritizing their response needs. This research addresses questions related to design of a cooperative system of organizations and citizens in CSC. Prior research by social scientists in a limited offline and online environment has provided a foundation for research on cooperative behavior challenges, including \u27articulation\u27 and \u27awareness\u27, but Web 2.0 supported CSC offers new challenges as well as opportunities. A CSC presents information overload for the organizational actors, especially in finding reliable information providers (for awareness), and finding actionable information from the data generated by citizens (for articulation). Also, we note three data level challenges: ambiguity in interpreting unconstrained natural language text, sparsity of user behaviors, and diversity of user demographics. Interdisciplinary research involving social and computer sciences is essential to address these socio-technical issues. I present a novel web information-processing framework, called the Identify-Match- Engage (IME) framework. IME allows operationalizing computation in design problems of awareness and articulation of the cooperative system between citizens and organizations, by addressing data problems of group engagement modeling and intent mining. The IME framework includes: a.) Identification of cooperation-assistive intent (seeking-offering) from short, unstructured messages using a classification model with declarative, social and contrast pattern knowledge, b.) Facilitation of coordination modeling using bipartite matching of complementary intent (seeking-offering), and c.) Identification of user groups to prioritize for engagement by defining a content-driven measure of \u27group discussion divergence\u27. The use of prior knowledge and interplay of features of users, content, and network structures efficiently captures context for computing cooperation-assistive behavior (intent and engagement) from unstructured social data in the online socio-technical systems. Our evaluation of a use-case of the crisis response domain shows improvement in performance for both intent classification and group engagement prioritization. Real world applications of this work include use of the engagement interface tool during various recent crises including the 2014 Jammu and Kashmir floods, and intent classification as a service integrated by the crisis mapping pioneer Ushahidi\u27s CrisisNET project for broader impact
Recommended from our members
Data-Driven Policy Optimisation for Multi-Domain Task-Oriented Dialogue
Recent developments in machine learning along with a general shift in the public attitude towards digital personal assistants has opened new frontiers for conversational systems. Nevertheless, building data-driven multi-domain conversational agents that act optimally given a dialogue context is an open challenge. The first step towards that goal is developing an efficient way of learning a dialogue policy in new domains. Secondly, it is important to have the ability to collect and utilise human-human conversational data to bootstrap an agent's knowledge. The work presented in this thesis demonstrates how a neural dialogue manager fine-tuned with reinforcement learning presents a viable approach for learning a dialogue policy efficiently and across many domains.
The thesis starts by introducing a dialogue management module that learns through interactions to act optimally given a current context of a conversation. The current shift towards neural, parameter-rich systems does not fully address the problem of error noise coming from speech recognition or natural language understanding components. A Bayesian approach is therefore proposed to learn more robust and effective policy management in direct interactions without any prior data. By putting a distribution over model weights, the learning agent is less prone to overfit to particular dialogue realizations and a more efficient exploration policy can be therefore employed. The results show that deep reinforcement learning performs on par with non-parametric models even in a low data regime while significantly reducing the computational complexity compared with the previous state-of-the-art.
The deployment of a dialogue manager without any pre-training on human conversations is not a viable option from an industry perspective. However, the progress in building statistical systems, particularly dialogue managers, is hindered by the scale of data available. To address this fundamental obstacle, a novel data-collection pipeline entirely based on crowdsourcing without the need for hiring professional annotators is introduced. The validation of the approach results in the collection of the Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully labeled collection of human-human written conversations spanning over multiple domains and topics. The proposed dataset creates a set of new benchmarks (belief tracking, policy optimisation, and response generation) significantly raising the complexity of analysed dialogues.
The collected dataset serves as a foundation for a novel reinforcement learning (RL)-based approach for training a multi-domain dialogue manager. A Multi-Action and Slot Dialogue Agent (MASDA) is proposed to combat some limitations: 1) handling complex multi-domain dialogues with multiple concurrent actions present in a single turn; and 2) lack of interpretability, which consequently impedes the use of intermediate signals (e.g., dialogue turn annotations) if such signals are available. MASDA explicitly models system acts and slots using intermediate signals, resulting in an improved task-based end-to-end framework. The model can also select concurrent actions in a single turn, thus enriching the representation of the generated responses. The proposed framework allows for RL training of dialogue task completion metrics when dealing with concurrent actions. The results demonstrate the advantages of both 1) handling concurrent actions and 2) exploiting intermediate signals: MASDA outperforms previous end-to-end frameworks while also offering improved scalability.EPSR
Donât interrupt me while Iâm speaking: Interruption in Everyday and Institutional Settings in Chinese
Interruption is a common phenomenon in conversation. Previous research of interruption has focused on three main aspects: the identification of interruption in relation to overlaps or overlapping speech, the categorisation of cooperative and disruptive interruptions, and the relationship between interruption and certain social factors, for instance, power asymmetry and gender differences. However, little attention has been paid to the degree of intrusiveness. Likewise, not much has been done to explore interactional factors that may intersect with interruptions. With these important research gaps in mind, I aim to explore the relationship between intrusiveness and interactional dimensions of interruptions in the Chinese context in this study. Two sets of conversational data were collected: telephone conversations and TV talk show conversations. The conversation analytic method was used to examine the fine-grained details of speakersâ conversational interaction (Haugh, 2012). Statistical methods were used to test the relationship between factors related to interruptions. Results from a linear regression model indicate that, in both settings, speakers tend to heed and boost the current information flow (e.g., supplementing further details) when expressing affiliative stances. More specifically, in the institutional conversation, speakers orient their interruption utterances towards the their assigned institutional role and task (Goffman, 1981; Heritage & Greatbatch, 1991). In the telephone conversation, there are frequent early interruptions, affiliative interruptions, and unexpected cases where interrupters align their opinions with the other whilst disrupting the current information flow. Based on what emerged from these analyses, I argue that the Chinese speakers in the two corpora feature a high involvement (Tannen, 2005) conversational style, which means they prioritise relationship over the task in discussion. In other words, speakers tend to distinctively emphasise their enthusiasm and engagement with the other speaker, but pay less attention to the one-speaker-at-a-time turn-taking rule (Sacks, Schegloff, & Jefferson, 1974). The finding of relationship-focus of Chinese talk-in-interaction supports the argument that Chinese society largely adheres to the polychronic time orientation (Hall, 1984). This study contributes to CA methodology by combining rigorous quantification methods with close examination of sequential organisation of interruptions. It is innovative in measuring intrusiveness by incorporating two aspects of interruptions: the interrupterâs stance-taking and the interrupterâs sequential alignment with the information flow of the prior utterance. In so doing, this study contributes to the understanding of interruption by demonstrating that intrusiveness is a gradient concept on a measurable continuum rather than a binary concept that is either cooperative or intrusive. This study contributes to the investigation into Chinese talk-in-interaction, particularly speakersâ conversational style, by proposing a novel perspective: interruption. Keywords: Interruption, intrusiveness, affiliation, information flow, interruption marker, interruption timing, Chinese talk-in-interactio
My watch begins : identification and procedural rhetoric on second screens and social networks.
Digital rhetoric creates opportunities for examining rhetoric as it evolves daily. This evolution may be described in terms of network circulation and immediate opportunities for publishing and creating. This project analyzes mobile applications and live feeds used during television broadcasts, where rhetoric is closely tied to the work of identifying with another point of view. Producers and designers of dual-screen applications prompt us to answer how we would act if we assumed the role of protagonist and saw the world through her or his eyes. These questions support the idea that identification is not just a relative of empathy or a way to engage emotionally with the text but also a way to approach problems and sharpen observation. From this dissertationâs findings we may reconsider the work of seeing and perspectival shifting as part of a sophisticated procedure of reflexive role play and public intellectualism. In addition, the analysis provides information about how mobile devices and second screens work to support consensus and a preferred reading (viewing) of popular narratives and group performances, thereby calling into careful consideration how we use such devices to influence others. Finally, the dissertationâs work helps us understand new forms of viral communication and the velocity (Ridolfo and DeVoss) at which they are transmitted. Consequently, we may approach textual artifacts as âliving documentsâ and consider how such âlivingâ properties may change our perceptions of authorship and composing. In Chapter One, âMy Watch Begins: Complex Narrative, Transmedia, and Point of View,â I begin by offering an overview of my methodological approach to these applications. I situate the work of identification on mobile devices within the larger conversation surrounding transmedia and how it encourages viewers to participate in contemporary television narratives. This section provides explanations of how the terms procedural rhetoric (as introduced first by Ian Bogost), prosopopoiea (from ancient rhetoric), and point of view (from narrative theory) will function in this project, with most of the attention given to procedural and rhetorical studies of the various programs and websites associated with audience writings. This chapter also calls attention to the difference between empathy and perspective shifting. An example from contemporary culture that helps illustrate this difference and provides space for conversation is the viral blog post âI Am Adam Lanzaâs Mother.â This editorial, written in the aftermath of the Sandy Hook shooting in 2012, features identification techniques used as persuasive tools but does so in a problematic way that might be better handled with a nuanced and careful study of how identification operates in other settings. Central to this project are questions addressing how we discuss and document the acts of viewing/seeing/looking, and in what ways the process of seeing from multiple perspectives is currently being lauded in society and the academy. In Chapter Two, âIf You See Something, Say Something: Syncing Audience Viewing and Response,â I reveal two opening examples that illustrate these premises: one from a Walking Dead advertisement that features the protagonistâs eye and one from a Department of Homeland Security ad-âIf You See Something, Say Something.â These examples dovetail into a specific analysis of syncing devices, or dual screen viewing experiences, and the actual rhetoric accompanying the requests to see from multiple perspectives (âIf you were Rick, you would ___â). I also call attention to shows where the act of identifying with the protagonist raises questions about the limitations of perspectives. To be specific, I suggest that the white middle-class male is the paradigm of identification exercises for shows that encourage participation from viewers. Examples from television suggest that women and minorities are less likely to be the characters with whom we align our interests; therefore, I argue we should interrogate this trend and think reflexively about the act of identifying. In Chapter Three, âChoreographing Conversation through Tagging, Tokens, and Reblogs,â I argue that analysis of audience reactions via live feeds and blogging platforms shows that textual artifacts, through increased circulation, promote a certain form of identification through consensus. This consensus reveals the tendency of viewers to gravitate toward preferred readings (viewings) of narratives and to identify with characters closely resembling themselves. By constituting viewers in a rhetoric specific to each fictional world, producers encourage identification and help secure appropriate and largely positive viewer behaviors through conversations online. Specifically, digital activities like âchecking inâ to a show and writing with specific hashtags become markers of narrative involvement. Producers, in turn, engage in reciprocal action by promoting or displaying fan activity on their own feeds, thereby sponsoring the work of the audience. While such activity often leads to conformity, I argue that these moments of group consensus may act as springboards for future conversation about other perspectives and narrative outcomes. In Chapter Four, âTexts as Bodies, Bodies as Texts: Tumblr Role Play and the Rhetorical Practices of Identification,â the rhetorical analysis of these online sites and mobile applications then leads to questions of how we perceive embodiment during identification. In this section I look closely at the writing found on the microblogging site Tumblr, where viewers of television narrative engage in role playing their favorite protagonists and creating dialogue with fellow role players. This practice, operating outside the jurisdiction of producer-designed apps, reveals new patterns of the work of identification. With attention to the ideas of Katherine Hayles and Deleuze and Guattari, we may reconsider how text, once circulated, acts as an extension of and a replacement for the physical body. Still, the work of these bloggers demonstrates that identification is still a personal investment that refers to and gives credit to the person behind the computer screen. This chapter reveals a productive tension between the embodied authorâs work and the nature of writing as it moves through networks. In my conclusion I explain how these applications and online tools have implications for the writing classroom. Students are frequently told that good writers and thinkers must see a problem or an issue from multiple perspectives. This project focuses intensely on the work of shifting perspectives and how those perspectives are represented in writing. Its implications for teaching productive source integration and research may be applied to the first-year writing classroom but also the graduate class curriculum, where novice scholars learn to extend, oppose, and ally themselves with the scholars who have come before them
- âŠ