221 research outputs found
Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor
Using supporting backchannel (BC) cues can make human-computer interaction
more social. BCs provide a feedback from the listener to the speaker indicating
to the speaker that he is still listened to. BCs can be expressed in different
ways, depending on the modality of the interaction, for example as gestures or
acoustic cues. In this work, we only considered acoustic cues. We are proposing
an approach towards detecting BC opportunities based on acoustic input features
like power and pitch. While other works in the field rely on the use of a
hand-written rule set or specialized features, we made use of artificial neural
networks. They are capable of deriving higher order features from input
features themselves. In our setup, we first used a fully connected feed-forward
network to establish an updated baseline in comparison to our previously
proposed setup. We also extended this setup by the use of Long Short-Term
Memory (LSTM) networks which have shown to outperform feed-forward based setups
on various tasks. Our best system achieved an F1-Score of 0.37 using power and
pitch features. Adding linguistic information using word2vec, the score
increased to 0.39
Oh, Jeez! or Uh-huh? A Listener-aware Backchannel Predictor on ASR Transcriptions
This paper presents our latest investigation on modeling backchannel in
conversations. Motivated by a proactive backchanneling theory, we aim at
developing a system which acts as a proactive listener by inserting
backchannels, such as continuers and assessment, to influence speakers. Our
model takes into account not only lexical and acoustic cues, but also
introduces the simple and novel idea of using listener embeddings to mimic
different backchanneling behaviours. Our experimental results on the
Switchboard benchmark dataset reveal that acoustic cues are more important than
lexical cues in this task and their combination with listener embeddings works
best on both, manual transcriptions and automatically generated transcriptions.Comment: Published in ICASSP 202
Modeling Speaker-Listener Interaction for Backchannel Prediction
We present our latest findings on backchannel modeling novelly motivated by
the canonical use of the minimal responses Yeah and Uh-huh in English and their
correspondent tokens in German, and the effect of encoding the speaker-listener
interaction. Backchanneling theories emphasize the active and continuous role
of the listener in the course of the conversation, their effects on the
speaker's subsequent talk, and the consequent dynamic speaker-listener
interaction. Therefore, we propose a neural-based acoustic backchannel
classifier on minimal responses by processing acoustic features from the
speaker speech, capturing and imitating listeners' backchanneling behavior, and
encoding speaker-listener interaction. Our experimental results on the
Switchboard and GECO datasets reveal that in almost all tested scenarios the
speaker or listener behavior embeddings help the model make more accurate
backchannel predictions. More importantly, a proper interaction encoding
strategy, i.e., combining the speaker and listener embeddings, leads to the
best performance on both datasets in terms of F1-score.Comment: Published in IWSDS 202
Character expression for spoken dialogue systems with semi-supervised learning using Variational Auto-Encoder
Character of spoken dialogue systems is important not only for giving a positive impression of the system but also for gaining rapport from users. We have proposed a character expression model for spoken dialogue systems. The model expresses three character traits (extroversion, emotional instability, and politeness) of spoken dialogue systems by controlling spoken dialogue behaviors: utterance amount, backchannel, filler, and switching pause length. One major problem in training this model is that it is costly and time-consuming to collect many pair data of character traits and behaviors. To address this problem, semi-supervised learning is proposed based on a variational auto-encoder that exploits both the limited amount of labeled pair data and unlabeled corpus data. It was confirmed that the proposed model can express given characters more accurately than a baseline model with only supervised learning. We also implemented the character expression model in a spoken dialogue system for an autonomous android robot, and then conducted a subjective experiment with 75 university students to confirm the effectiveness of the character expression for specific dialogue scenarios. The results showed that expressing a character in accordance with the dialogue task by the proposed model improves the user’s impression of the appropriateness in formal dialogue such as job interview
Iterative Perceptual Learning for Social Behavior Synthesis
We introduce Iterative Perceptual Learning (IPL), a novel approach for learning computational models for social behavior synthesis from corpora of human-human interactions. The IPL approach combines perceptual evaluation with iterative model refinement. Human observers rate the appropriateness of synthesized individual behaviors in the context of a conversation. These ratings are in turn used to refine the machine learning models. As the ratings correspond to those moments in the conversation where the production of a specific social behavior is inappropriate, we can regard features extracted at these moments as negative samples for the training of a machine learning classifier. This is an advantage over traditional corpusbased approaches, in which negative samples at extracted at random from moments in the conversation where the specific social behavior does not occur. We perform a comparison between the IPL approach and the traditional corpus-based approach on the timing of backchannels for a listener in speaker-listener dialogs. While both models perform similarly in terms of precision and recall scores, the results of the IPL model are rated as more appropriate in the perceptual evaluation.We additionally investigate the effect of the amount of available training data and the variation of training data on the outcome of the models
Backchannel relevance spaces
This contribution introduces backchannel relevance spaces – intervals where it is relevant for a listener in a conversation to produce a backchannel. By annotating and comparing actual visual and vocal backchannels with potential backchannels established using a group of subjects acting as third-party listeners, we show (i) that visual only backchannels represent a substantial proportion of all backchannels; and (ii) that there are more opportunities for backchannels (i.e. potential backchannels or backchannel relevance spaces) than there are actual vocal and visual backchannels. These findings indicate that backchannel relevance spaces enable more accurate acoustic, prosodic, lexical (et cetera) descriptions of backchannel inviting cues than descriptions based on the context of actual vocal backchannels only
Backchannel Strategies for Artificial Listeners
We evaluate multimodal rule-based strategies for backchannel (BC) generation in face-to-face conversations. Such strategies can be used by artificial listeners to determine when to produce a BC in dialogs with human speakers. In this research, we consider features from the speaker’s speech and gaze. We used six rule-based strategies to determine the placement of BCs. The BCs were performed by an intelligent virtual agent using nods and vocalizations. In a user perception experiment, participants were shown video fragments of a human speaker together with an artificial listener who produced BC behavior according to one of the strategies. Participants were asked to rate how likely they thought the BC behavior had been performed by a human listener. We found that the number, timing and type of BC had a significant effect on how human-like the BC behavior was perceived
- …