15 research outputs found

    An End-to-End Conversational Style Matching Agent

    Full text link
    We present an end-to-end voice-based conversational agent that is able to engage in naturalistic multi-turn dialogue and align with the interlocutor's conversational style. The system uses a series of deep neural network components for speech recognition, dialogue generation, prosodic analysis and speech synthesis to generate language and prosodic expression with qualities that match those of the user. We conducted a user study (N=30) in which participants talked with the agent for 15 to 20 minutes, resulting in over 8 hours of natural interaction data. Users with high consideration conversational styles reported the agent to be more trustworthy when it matched their conversational style. Whereas, users with high involvement conversational styles were indifferent. Finally, we provide design guidelines for multi-turn dialogue interactions using conversational style adaptation

    A High-Fidelity Open Embodied Avatar with Lip Syncing and Expression Capabilities

    Full text link
    Embodied avatars as virtual agents have many applications and provide benefits over disembodied agents, allowing non-verbal social and interactional cues to be leveraged, in a similar manner to how humans interact with each other. We present an open embodied avatar built upon the Unreal Engine that can be controlled via a simple python programming interface. The avatar has lip syncing (phoneme control), head gesture and facial expression (using either facial action units or cardinal emotion categories) capabilities. We release code and models to illustrate how the avatar can be controlled like a puppet or used to create a simple conversational agent using public application programming interfaces (APIs). GITHUB link: https://github.com/danmcduff/AvatarSimComment: International Conference on Multimodal Interaction (ICMI 2019

    Affect between Humans and Conversational Agents: A Review and Organizing Frameworks

    Get PDF
    Conversational agents (CAs), which communicate naturally with humans, are being developed and employed for a variety of tasks. Interactions between humans and CAs induce affect, which is vital to the adoption and performance of CAs. Yet, there is a lack of cumulative understanding of existing research on affect in human-CA interaction. Motivated thus, this article presents a systematic review of empirical IS and HCI studies on such affect, its antecedents and consequences. Besides conducting descriptive analysis of the studies, we also divide them into two broad categories – emotion-related, and those related to other (more persistent) affective responses. We present organizing frameworks for both categories, which complement each other. Through the review and frameworks, we contribute towards attaining a holistic understanding of extant research on human-CA interaction, identifying gaps in prior knowledge, and outlining future research directions. Last, we describe our plan for extending this work to gain additional insights

    Affect between Humans and Conversational Agents: A Review and Future Research Directions

    Get PDF
    Conversational agents (CAs) are digital artifacts which communicate with humans through natural language for performing a variety of tasks. When humans interact with CAs, affective responses e.g., emotions, are seen to influence both CA adoption and performance. Despite the vital role of affect in human-CA interaction, understanding of its antecedents, affective outcomes, and their relationships from prior research is nascent and segregated. Motivated thus, we review empirical studies on affect in human-CA interactions, classify them in terms of affective outcomes, and identify corresponding antecedents. We further analyze the relationships between antecedents and outcomes and highlight significant relationships as well as inconsistent findings. Drawing on the review, we propose future research directions for this area. We contribute by developing a deeper understanding of research on affect between human and CAs, identifying gaps in prior knowledge, and outlining future research directions. Additionally, we lay out our plans for extending this work

    ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation

    Full text link
    Image-grounded dialogue systems benefit greatly from integrating visual information, resulting in high-quality response generation. However, current models struggle to effectively utilize such information in zero-resource scenarios, mainly due to the disparity between image and text modalities. To overcome this challenge, we propose an innovative multimodal framework, called ZRIGF, which assimilates image-grounded information for dialogue generation in zero-resource situations. ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training. Contrastive pre-training includes a text-image matching module that maps images and texts into a unified encoded vector space, along with a text-assisted masked image modeling module that preserves pre-training visual features and fosters further multimodal feature alignment. Generative pre-training employs a multimodal fusion module and an information transfer module to produce insightful responses based on harmonized multimodal representations. Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF's efficacy in generating contextually pertinent and informative responses. Furthermore, we adopt a fully zero-resource scenario in the image-grounded dialogue dataset to demonstrate our framework's robust generalization capabilities in novel domains. The code is available at https://github.com/zhangbo-nlp/ZRIGF.Comment: ACM Multimedia 2023 Accpeted, Repo: https://github.com/zhangbo-nlp/ZRIG
    corecore