15 research outputs found
An End-to-End Conversational Style Matching Agent
We present an end-to-end voice-based conversational agent that is able to
engage in naturalistic multi-turn dialogue and align with the interlocutor's
conversational style. The system uses a series of deep neural network
components for speech recognition, dialogue generation, prosodic analysis and
speech synthesis to generate language and prosodic expression with qualities
that match those of the user. We conducted a user study (N=30) in which
participants talked with the agent for 15 to 20 minutes, resulting in over 8
hours of natural interaction data. Users with high consideration conversational
styles reported the agent to be more trustworthy when it matched their
conversational style. Whereas, users with high involvement conversational
styles were indifferent. Finally, we provide design guidelines for multi-turn
dialogue interactions using conversational style adaptation
A High-Fidelity Open Embodied Avatar with Lip Syncing and Expression Capabilities
Embodied avatars as virtual agents have many applications and provide
benefits over disembodied agents, allowing non-verbal social and interactional
cues to be leveraged, in a similar manner to how humans interact with each
other. We present an open embodied avatar built upon the Unreal Engine that can
be controlled via a simple python programming interface. The avatar has lip
syncing (phoneme control), head gesture and facial expression (using either
facial action units or cardinal emotion categories) capabilities. We release
code and models to illustrate how the avatar can be controlled like a puppet or
used to create a simple conversational agent using public application
programming interfaces (APIs). GITHUB link:
https://github.com/danmcduff/AvatarSimComment: International Conference on Multimodal Interaction (ICMI 2019
Affect between Humans and Conversational Agents: A Review and Organizing Frameworks
Conversational agents (CAs), which communicate naturally with humans, are being developed and employed for a variety of tasks. Interactions between humans and CAs induce affect, which is vital to the adoption and performance of CAs. Yet, there is a lack of cumulative understanding of existing research on affect in human-CA interaction. Motivated thus, this article presents a systematic review of empirical IS and HCI studies on such affect, its antecedents and consequences. Besides conducting descriptive analysis of the studies, we also divide them into two broad categories – emotion-related, and those related to other (more persistent) affective responses. We present organizing frameworks for both categories, which complement each other. Through the review and frameworks, we contribute towards attaining a holistic understanding of extant research on human-CA interaction, identifying gaps in prior knowledge, and outlining future research directions. Last, we describe our plan for extending this work to gain additional insights
Affect between Humans and Conversational Agents: A Review and Future Research Directions
Conversational agents (CAs) are digital artifacts which communicate with humans through natural language for performing a variety of tasks. When humans interact with CAs, affective responses e.g., emotions, are seen to influence both CA adoption and performance. Despite the vital role of affect in human-CA interaction, understanding of its antecedents, affective outcomes, and their relationships from prior research is nascent and segregated. Motivated thus, we review empirical studies on affect in human-CA interactions, classify them in terms of affective outcomes, and identify corresponding antecedents. We further analyze the relationships between antecedents and outcomes and highlight significant relationships as well as inconsistent findings. Drawing on the review, we propose future research directions for this area. We contribute by developing a deeper understanding of research on affect between human and CAs, identifying gaps in prior knowledge, and outlining future research directions. Additionally, we lay out our plans for extending this work
ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation
Image-grounded dialogue systems benefit greatly from integrating visual
information, resulting in high-quality response generation. However, current
models struggle to effectively utilize such information in zero-resource
scenarios, mainly due to the disparity between image and text modalities. To
overcome this challenge, we propose an innovative multimodal framework, called
ZRIGF, which assimilates image-grounded information for dialogue generation in
zero-resource situations. ZRIGF implements a two-stage learning strategy,
comprising contrastive pre-training and generative pre-training. Contrastive
pre-training includes a text-image matching module that maps images and texts
into a unified encoded vector space, along with a text-assisted masked image
modeling module that preserves pre-training visual features and fosters further
multimodal feature alignment. Generative pre-training employs a multimodal
fusion module and an information transfer module to produce insightful
responses based on harmonized multimodal representations. Comprehensive
experiments conducted on both text-based and image-grounded dialogue datasets
demonstrate ZRIGF's efficacy in generating contextually pertinent and
informative responses. Furthermore, we adopt a fully zero-resource scenario in
the image-grounded dialogue dataset to demonstrate our framework's robust
generalization capabilities in novel domains. The code is available at
https://github.com/zhangbo-nlp/ZRIGF.Comment: ACM Multimedia 2023 Accpeted, Repo:
https://github.com/zhangbo-nlp/ZRIG