525 research outputs found
Character expression for spoken dialogue systems with semi-supervised learning using Variational Auto-Encoder
Character of spoken dialogue systems is important not only for giving a positive impression of the system but also for gaining rapport from users. We have proposed a character expression model for spoken dialogue systems. The model expresses three character traits (extroversion, emotional instability, and politeness) of spoken dialogue systems by controlling spoken dialogue behaviors: utterance amount, backchannel, filler, and switching pause length. One major problem in training this model is that it is costly and time-consuming to collect many pair data of character traits and behaviors. To address this problem, semi-supervised learning is proposed based on a variational auto-encoder that exploits both the limited amount of labeled pair data and unlabeled corpus data. It was confirmed that the proposed model can express given characters more accurately than a baseline model with only supervised learning. We also implemented the character expression model in a spoken dialogue system for an autonomous android robot, and then conducted a subjective experiment with 75 university students to confirm the effectiveness of the character expression for specific dialogue scenarios. The results showed that expressing a character in accordance with the dialogue task by the proposed model improves the user’s impression of the appropriateness in formal dialogue such as job interview
Can a robot laugh with you?: Shared laughter generation for empathetic spoken dialogue
人と一緒に笑う会話ロボットを開発 --人に共感し、人と共生する会話AIの実現に向けて--. 京都大学プレスリリース. 2022-09-29.Spoken dialogue systems must be able to express empathy to achieve natural interaction with human users. However, laughter generation requires a high level of dialogue understanding. Thus, implementing laughter in existing systems, such as in conversational robots, has been challenging. As a first step toward solving this problem, rather than generating laughter from user dialogue, we focus on “shared laughter, ” where a user laughs using either solo or speech laughs (initial laugh), and the system laughs in turn (response laugh). The proposed system consists of three models: 1) initial laugh detection, 2) shared laughter prediction, and 3) laugh type selection. We trained each model using a human-robot speed dating dialogue corpus. For the first model, a recurrent neural network was applied, and the detection performance achieved an F1 score of 82.6%. The second model used the acoustic and prosodic features of the initial laugh and achieved a prediction accuracy above that of the random prediction. The third model selects the type of system’s response laugh as social or mirthful laugh based on the same features of the initial laugh. We then implemented the full shared laughter generation system in an attentive listening dialogue system and conducted a dialogue listening experiment. The proposed system improved the impression of the dialogue system such as empathy perception compared to a naive baseline without laughter and a reactive system that always responded with only social laughs. We propose that our system can be used for situated robot interaction and also emphasize the need for integrating proper empathetic laughs into conversational robots and agents
Towards Objective Evaluation of Socially-Situated Conversational Robots: Assessing Human-Likeness through Multimodal User Behaviors
This paper tackles the challenging task of evaluating socially situated
conversational robots and presents a novel objective evaluation approach that
relies on multimodal user behaviors. In this study, our main focus is on
assessing the human-likeness of the robot as the primary evaluation metric.
While previous research often relied on subjective evaluations from users, our
approach aims to evaluate the robot's human-likeness based on observable user
behaviors indirectly, thus enhancing objectivity and reproducibility. To begin,
we created an annotated dataset of human-likeness scores, utilizing user
behaviors found in an attentive listening dialogue corpus. We then conducted an
analysis to determine the correlation between multimodal user behaviors and
human-likeness scores, demonstrating the feasibility of our proposed
behavior-based evaluation method.Comment: Accepted by 25th ACM International Conference on Multimodal
Interaction (ICMI '23), Late-Breaking Result
Conversational AI and Knowledge Graphs for Social Robot Interaction
The paper describes an approach that combines work from three fields with previously separate research commu-nities: social robotics, conversational AI, and graph databases. The aim is to develop a generic framework in which a variety of social robots can provide high-quality information to users by accessing semantically-rich knowledge graphs about multiple different domains. An example implementation uses a Furhat robot with Rasa open source conversational AI and knowledge graphs in Neo4j graph databases.Peer reviewe
音声対話システムのためのキャラクタ表現に基づくタスク・ユーザ適応
京都大学新制・課程博士博士(情報学)甲第24728号情博第816号新制||情||137(附属図書館)京都大学大学院情報学研究科知能情報学専攻(主査)教授 河原 達也, 教授 熊田 孝恒, 教授 黒橋 禎夫学位規則第4条第1項該当Doctor of InformaticsKyoto UniversityDFA
Building and Designing Expressive Speech Synthesis
We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. As speech systems enter the social domain they are required to interact, support and mediate our social relationships with 1) each other, 2) with digital information, and, increasingly, 3) with AI-based algorithms and processes. Socially Interactive Agents (SIAs) are at the fore- front of research and innovation in this area. There is an assumption that in the future “spoken language will provide a natural conversational interface between human beings and so-called intelligent systems.” [Moore 2017, p. 283]. A considerable amount of previous research work has tested this assumption with mixed results. However, as pointed out “voice interfaces have become notorious for fostering frustration and failure” [Nass and Brave 2005, p.6]. It is within this context, between our exceptional and intelligent human use of speech to communicate and interact with other humans, and our desire to leverage this means of communication for artificial systems, that the technology, often termed expressive speech synthesis uncomfortably falls. Uncomfortably, because it is often overshadowed by issues in interactivity and the underlying intelligence of the system which is something that emerges from the interaction of many of the components in a SIA. This is especially true of what we might term conversational speech, where decoupling how things are spoken, from when and to whom they are spoken, can seem an impossible task. This is an even greater challenge in evaluation and in characterising full systems which have made use of expressive speech. Furthermore when designing an interaction with a SIA, we must not only consider how SIAs should speak but how much, and whether they should even speak at all. These considerations cannot be ignored. Any speech synthesis that is used in the context of an artificial agent will have a perceived accent, a vocal style, an underlying emotion and an intonational model. Dimensions like accent and personality (cross speaker parameters) as well as vocal style, emotion and intonation during an interaction (within-speaker parameters) need to be built in the design of a synthetic voice. Even a default or neutral voice has to consider these same expressive speech synthesis components. Such design parameters have a strong influence on how effectively a system will interact, how it is perceived and its assumed ability to perform a task or function. To ignore these is to blindly accept a set of design decisions that ignores the complex effect speech has on the user’s successful interaction with a system. Thus expressive speech synthesis is a key design component in SIAs. This chapter explores the world of expressive speech synthesis, aiming to act as a starting point for those interested in the design, building and evaluation of such artificial speech. The debates and literature within this topic are vast and are fundamentally multidisciplinary in focus, covering a wide range of disciplines such as linguistics, pragmatics, psychology, speech and language technology, robotics and human-computer interaction (HCI), to name a few. It is not our aim to synthesise these areas but to give a scaffold and a starting point for the reader by exploring the critical dimensions and decisions they may need to consider when choosing to use expressive speech. To do this, the chapter explores the building of expressive synthesis, highlighting key decisions and parameters as well as emphasising future challenges in expressive speech research and development. Yet, before these are expanded upon we must first try and define what we actually mean by expressive speech
- …