5,483 research outputs found

    Automatically Predicting User Ratings for Conversational Systems

    Get PDF
    Automatic evaluation models for open-domain conversational agents either correlate poorly with human judgment or require expensive annotations on top of conversation scores. In this work we investigate the feasibility of learning evaluation models without relying on any further annotations besides conversation-level human ratings. We use a dataset of rated (1-5) open domain spoken conversations between the conversational agent Roving Mind (competing in the Amazon Alexa Prize Challenge 2017) and Amazon Alexa users. First, we assess the complexity of the task by asking two experts to re-annotate a sample of the dataset and observe that the subjectivity of user ratings yields a low upper-bound. Second, through an analysis of the entire dataset we show that automatically extracted features such as user sentiment, Dialogue Acts and conversation length have significant, but low correlation with user ratings. Finally, we report the results of our experiments exploring different combinations of these features to train automatic dialogue evaluation models. Our work suggests that predicting subjective user ratings in open domain conversations is a challenging task.I modelli stato dell’arte per la valutazione automatica di agenti conversazionali open-domain hanno una scarsa correlazione con il giudizio umano oppure richiedono costose annotazioni oltre al punteggio dato alla conversazione. In questo lavoro investighiamo la possibilità di apprendere modelli di valutazione attraverso il solo utilizzo di punteggi umani dati all’intera conversazione. Il corpus utilizzato è composto da conversazioni parlate open-domain tra l’agente conversazionale Roving Mind (parte della competizione Amazon Alexa Prize 2017) e utenti di Amazon Alexa valutate con punteggi da 1 a 5. In primo luogo, valutiamo la complessità del task assegnando a due esperti il compito di riannotare una parte del corpus e osserviamo come esso risulti complesso perfino per annotatori umani data la sua soggettività. In secondo luogo, tramite un’analisi condotta sull’intero corpus mostriamo come features estratte automaticamente (sentimento dell’utente, Dialogue Acts e lunghezza della conversazione) hanno bassa, ma significativa correlazione con il giudizio degli utenti. Infine, riportiamo i risultati di esperimenti volti a esplorare diverse combinazioni di queste features per addestrare modelli di valutazione automatica del dialogo. Questo lavoro mostra la difficoltà del predire i giudizi soggettivi degli utenti in conversazioni senza un task specifico

    Survey on Evaluation Methods for Dialogue Systems

    Get PDF
    In this paper we survey the methods and concepts developed for the evaluation of dialogue systems. Evaluation is a crucial part during the development process. Often, dialogue systems are evaluated by means of human evaluations and questionnaires. However, this tends to be very cost and time intensive. Thus, much work has been put into finding methods, which allow to reduce the involvement of human labour. In this survey, we present the main concepts and methods. For this, we differentiate between the various classes of dialogue systems (task-oriented dialogue systems, conversational dialogue systems, and question-answering dialogue systems). We cover each class by introducing the main technologies developed for the dialogue systems and then by presenting the evaluation methods regarding this class

    Sympathy Begins with a Smile, Intelligence Begins with a Word: Use of Multimodal Features in Spoken Human-Robot Interaction

    Full text link
    Recognition of social signals, from human facial expressions or prosody of speech, is a popular research topic in human-robot interaction studies. There is also a long line of research in the spoken dialogue community that investigates user satisfaction in relation to dialogue characteristics. However, very little research relates a combination of multimodal social signals and language features detected during spoken face-to-face human-robot interaction to the resulting user perception of a robot. In this paper we show how different emotional facial expressions of human users, in combination with prosodic characteristics of human speech and features of human-robot dialogue, correlate with users' impressions of the robot after a conversation. We find that happiness in the user's recognised facial expression strongly correlates with likeability of a robot, while dialogue-related features (such as number of human turns or number of sentences per robot utterance) correlate with perceiving a robot as intelligent. In addition, we show that facial expression, emotional features, and prosody are better predictors of human ratings related to perceived robot likeability and anthropomorphism, while linguistic and non-linguistic features more often predict perceived robot intelligence and interpretability. As such, these characteristics may in future be used as an online reward signal for in-situ Reinforcement Learning based adaptive human-robot dialogue systems.Comment: Robo-NLP workshop at ACL 2017. 9 pages, 5 figures, 6 table

    Modelling Participant Affect in Meetings with Turn-Taking Features

    Get PDF
    This paper explores the relationship between turn-taking and meeting affect. To investigate this, we model post-meeting ratings of satisfaction, cohesion and leadership from participants of AMI corpus meetings using group and individual turn-taking features. The results indicate that participants gave higher satisfaction and cohesiveness ratings to meetings with greater group turn-taking freedom and individual very short utterance rates, while lower ratings were associated with more silence and speaker overlap. Besides broad applicability to satisfaction ratings, turn-taking freedom was found to be a better predictor than equality of speaking time when considering whether participants felt that everyone they had a chance to contribute. If we include dialogue act information, we see that substantive feedback type turns like assessments are more predictive of meeting affect than information giving acts or backchannels. This work highlights the importance of feedback turns and modelling group level activity in multiparty dialogue for understanding the social aspects of speech

    The Social World of Content Abusers in Community Question Answering

    Full text link
    Community-based question answering platforms can be rich sources of information on a variety of specialized topics, from finance to cooking. The usefulness of such platforms depends heavily on user contributions (questions and answers), but also on respecting the community rules. As a crowd-sourced service, such platforms rely on their users for monitoring and flagging content that violates community rules. Common wisdom is to eliminate the users who receive many flags. Our analysis of a year of traces from a mature Q&A site shows that the number of flags does not tell the full story: on one hand, users with many flags may still contribute positively to the community. On the other hand, users who never get flagged are found to violate community rules and get their accounts suspended. This analysis, however, also shows that abusive users are betrayed by their network properties: we find strong evidence of homophilous behavior and use this finding to detect abusive users who go under the community radar. Based on our empirical observations, we build a classifier that is able to detect abusive users with an accuracy as high as 83%.Comment: Published in the proceedings of the 24th International World Wide Web Conference (WWW 2015

    A Satisfaction-based Model for Affect Recognition from Conversational Features in Spoken Dialog Systems

    Get PDF
    Detecting user affect automatically during real-time conversation is the main challenge towards our greater aim of infusing social intelligence into a natural-language mixed-initiative High-Fidelity (Hi-Fi) audio control spoken dialog agent. In recent years, studies on affect detection from voice have moved on to using realistic, non-acted data, which is subtler. However, it is more challenging to perceive subtler emotions and this is demonstrated in tasks such as labelling and machine prediction. This paper attempts to address part of this challenge by considering the role of user satisfaction ratings and also conversational/dialog features in discriminating contentment and frustration, two types of emotions that are known to be prevalent within spoken human-computer interaction. However, given the laboratory constraints, users might be positively biased when rating the system, indirectly making the reliability of the satisfaction data questionable. Machine learning experiments were conducted on two datasets, users and annotators, which were then compared in order to assess the reliability of these datasets. Our results indicated that standard classifiers were significantly more successful in discriminating the abovementioned emotions and their intensities (reflected by user satisfaction ratings) from annotator data than from user data. These results corroborated that: first, satisfaction data could be used directly as an alternative target variable to model affect, and that they could be predicted exclusively by dialog features. Second, these were only true when trying to predict the abovementioned emotions using annotator?s data, suggesting that user bias does exist in a laboratory-led evaluation

    Online backchannel synthesis evaluation with the switching Wizard of Oz

    Get PDF
    In this paper, we evaluate a backchannel synthesis algorithm in an online conversation between a human speaker and a virtual listener. We adopt the Switching Wizard of Oz (SWOZ) approach to assess behavior synthesis algorithms online. A human speaker watches a virtual listener that is either controlled by a human listener or by an algorithm. The source switches at random intervals. Speakers indicate when they feel they are no longer talking to a human listener. Analysis of these responses reveals patterns of inappropriate behavior in terms of quantity and timing of backchannels

    Backchannels: Quantity, Type and Timing Matters

    Get PDF
    In a perception experiment, we systematically varied the quantity, type and timing of backchannels. Participants viewed stimuli of a real speaker side-by-side with an animated listener and rated how human-like they perceived the latter's backchannel behavior. In addition, we obtained measures of appropriateness and optionality for each backchannel from key strokes. This approach allowed us to analyze the influence of each of the factors on entire fragments and on individual backchannels. The originally performed type and timing of a backchannel appeared to be more human-like, compared to a switched type or random timing. In addition, we found that nods are more often appropriate than vocalizations. For quantity, too few or too many backchannels per minute appeared to reduce the quality of the behavior. These findings are important for the design of algorithms for the automatic generation of backchannel behavior for artificial listeners
    corecore