Search CORE

221 research outputs found

Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor

Author: A Stolcke
A Waibel
J Niehues
K Laskowski
M Schroder
N Srivastava
N Ward
X Glorot
Publication venue
Publication date: 02/06/2017
Field of study

Using supporting backchannel (BC) cues can make human-computer interaction more social. BCs provide a feedback from the listener to the speaker indicating to the speaker that he is still listened to. BCs can be expressed in different ways, depending on the modality of the interaction, for example as gestures or acoustic cues. In this work, we only considered acoustic cues. We are proposing an approach towards detecting BC opportunities based on acoustic input features like power and pitch. While other works in the field rely on the use of a hand-written rule set or specialized features, we made use of artificial neural networks. They are capable of deriving higher order features from input features themselves. In our setup, we first used a fully connected feed-forward network to establish an updated baseline in comparison to our previously proposed setup. We also extended this setup by the use of Long Short-Term Memory (LSTM) networks which have shown to outperform feed-forward based setups on various tasks. Our best system achieved an F1-Score of 0.37 using power and pitch features. Adding linguistic information using word2vec, the score increased to 0.39

arXiv.org e-Print Archive

Crossref

Oh, Jeez! or Uh-huh? A Listener-aware Backchannel Predictor on ASR Transcriptions

Author: Li Chia-Yu
Ortega Daniel
Vu Ngoc Thang
Publication venue
Publication date: 10/04/2023
Field of study

This paper presents our latest investigation on modeling backchannel in conversations. Motivated by a proactive backchanneling theory, we aim at developing a system which acts as a proactive listener by inserting backchannels, such as continuers and assessment, to influence speakers. Our model takes into account not only lexical and acoustic cues, but also introduces the simple and novel idea of using listener embeddings to mimic different backchanneling behaviours. Our experimental results on the Switchboard benchmark dataset reveal that acoustic cues are more important than lexical cues in this task and their combination with listener embeddings works best on both, manual transcriptions and automatically generated transcriptions.Comment: Published in ICASSP 202

arXiv.org e-Print Archive

Modeling Speaker-Listener Interaction for Backchannel Prediction

Author: Meyer Sarina
Ortega Daniel
Schweitzer Antje
Vu Ngoc Thang
Publication venue
Publication date: 10/04/2023
Field of study

We present our latest findings on backchannel modeling novelly motivated by the canonical use of the minimal responses Yeah and Uh-huh in English and their correspondent tokens in German, and the effect of encoding the speaker-listener interaction. Backchanneling theories emphasize the active and continuous role of the listener in the course of the conversation, their effects on the speaker's subsequent talk, and the consequent dynamic speaker-listener interaction. Therefore, we propose a neural-based acoustic backchannel classifier on minimal responses by processing acoustic features from the speaker speech, capturing and imitating listeners' backchanneling behavior, and encoding speaker-listener interaction. Our experimental results on the Switchboard and GECO datasets reveal that in almost all tested scenarios the speaker or listener behavior embeddings help the model make more accurate backchannel predictions. More importantly, a proper interaction encoding strategy, i.e., combining the speaker and listener embeddings, leads to the best performance on both datasets in terms of F1-score.Comment: Published in IWSDS 202

arXiv.org e-Print Archive

Character expression for spoken dialogue systems with semi-supervised learning using Variational Auto-Encoder

Author: Inoue Koji
Kawahara Tatsuya
Yamamoto Kenta
Publication venue: Elsevier BV
Publication date: 01/04/2023
Field of study

Character of spoken dialogue systems is important not only for giving a positive impression of the system but also for gaining rapport from users. We have proposed a character expression model for spoken dialogue systems. The model expresses three character traits (extroversion, emotional instability, and politeness) of spoken dialogue systems by controlling spoken dialogue behaviors: utterance amount, backchannel, filler, and switching pause length. One major problem in training this model is that it is costly and time-consuming to collect many pair data of character traits and behaviors. To address this problem, semi-supervised learning is proposed based on a variational auto-encoder that exploits both the limited amount of labeled pair data and unlabeled corpus data. It was confirmed that the proposed model can express given characters more accurately than a baseline model with only supervised learning. We also implemented the character expression model in a spoken dialogue system for an autonomous android robot, and then conducted a subjective experiment with 75 university students to confirm the effectiveness of the character expression for specific dialogue scenarios. The results showed that expressing a character in accordance with the dialogue task by the proposed model improves the user’s impression of the appropriateness in formal dialogue such as job interview

Kyoto University Research Information Repository

Iterative Perceptual Learning for Social Behavior Synthesis

Author: de Kok I.A.
Heylen Dirk K.J.
Poppe Ronald Walter
Publication venue: Centre for Telematics and Information Technology (CTIT)
Publication date: 01/02/2012
Field of study

We introduce Iterative Perceptual Learning (IPL), a novel approach for learning computational models for social behavior synthesis from corpora of human-human interactions. The IPL approach combines perceptual evaluation with iterative model refinement. Human observers rate the appropriateness of synthesized individual behaviors in the context of a conversation. These ratings are in turn used to refine the machine learning models. As the ratings correspond to those moments in the conversation where the production of a specific social behavior is inappropriate, we can regard features extracted at these moments as negative samples for the training of a machine learning classifier. This is an advantage over traditional corpusbased approaches, in which negative samples at extracted at random from moments in the conversation where the specific social behavior does not occur. We perform a comparison between the IPL approach and the traditional corpus-based approach on the timing of backchannels for a listener in speaker-listener dialogs. While both models perform similarly in terms of precision and recall scores, the results of the IPL model are rated as more appropriate in the perceptual evaluation.We additionally investigate the effect of the amount of available training data and the variation of training data on the outcome of the models

University of Twente Research Information

The MultiLis Corpus - Dealing with Individual Differences in Nonverbal Listening Behavior

Author: de Kok I.A.
Heylen Dirk K.J.
Publication venue: Springer
Publication date: 01/07/2011
Field of study

University of Twente Research Information

Backchannel relevance spaces

Author: Anna Hjalmarsson
Jens Edlund
Mattias Heldner
Publication venue
Publication date: 01/01/2013
Field of study

This contribution introduces backchannel relevance spaces – intervals where it is relevant for a listener in a conversation to produce a backchannel. By annotating and comparing actual visual and vocal backchannels with potential backchannels established using a group of subjects acting as third-party listeners, we show (i) that visual only backchannels represent a substantial proportion of all backchannels; and (ii) that there are more opportunities for backchannels (i.e. potential backchannels or backchannel relevance spaces) than there are actual vocal and visual backchannels. These findings indicate that backchannel relevance spaces enable more accurate acoustic, prosodic, lexical (et cetera) descriptions of backchannel inviting cues than descriptions based on the context of actual vocal backchannels only

CiteSeerX

Publikationer från KTH

Publikationer från Stockholms universitet

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Backchannel Strategies for Artificial Listeners

Author: Heylen Dirk K.J.
Poppe Ronald Walter
Reidsma Dennis
Truong Khiet Phuong
Publication venue: Springer
Publication date: 01/01/2010
Field of study

We evaluate multimodal rule-based strategies for backchannel (BC) generation in face-to-face conversations. Such strategies can be used by artificial listeners to determine when to produce a BC in dialogs with human speakers. In this research, we consider features from the speaker’s speech and gaze. We used six rule-based strategies to determine the placement of BCs. The BCs were performed by an intelligent virtual agent using nods and vocalizations. In a user perception experiment, participants were shown video fragments of a human speaker together with an artificial listener who produced BC behavior according to one of the strategies. Participants were asked to rate how likely they thought the BC behavior had been performed by a human listener. We found that the number, timing and type of BC had a significant effect on how human-like the BC behavior was perceived

University of Twente Research Information