8 research outputs found

    Psychological Metrics for Dialog System Evaluation

    Full text link
    We present metrics for evaluating dialog systems through a psychologically-grounded "human" lens in which conversational agents express a diversity of both states (e.g., emotion) and traits (e.g., personality), just as people do. We present five interpretable metrics from established psychology that are fundamental to human communication and relationships: emotional entropy, linguistic style and emotion matching, agreeableness, and empathy. These metrics can be applied (1) across dialogs and (2) on turns within dialogs. The psychological metrics are compared against seven state-of-the-art traditional metrics (e.g., BARTScore and BLEURT) on seven standard dialog system data sets. We also introduce a novel data set, the Three Bot Dialog Evaluation Corpus, which consists of annotated conversations from ChatGPT, GPT-3, and BlenderBot. We demonstrate that our proposed metrics offer novel information; they are uncorrelated with traditional metrics, can be used to meaningfully compare dialog systems, and lead to increased accuracy (beyond existing traditional metrics) in predicting crowd-sourced dialog judgements. The interpretability and unique signal of our psychological metrics make them a valuable tool for evaluating and improving dialog systems

    Perseverative Thinking is Associated with Features of Spoken Language

    No full text
    Perseverative thinking (PT) is a process that consists of difficulty disengaging from negative thinking; two common forms are worry and rumination. Existing measures of PT require individuals to report on their own thought processes, a method that may be subject to bias or errors. An unobtrusive, behavioral measure of PT would circumvent these biases, improving our ability to detect PT. One promising behavioral method is computational linguistic analysis, which has recently been used to investigate personality and mental health constructs (e.g., Guntuku et al., 2017; Park et al., 2015). Evidence from the co-rumination and expressed worry literatures (e.g., Parkinson & Simons, 2012; Spendelow et al., 2017), combined with the fact that PT is verbal-linguistic in nature (Ehring & Watkins, 2008), hints that PT may be particularly well-suited for detection in natural language. In this project, we will examine linguistic correlates of PT build and test a language-based model of PT

    Multilingual Language Models are not Multicultural: A Case Study in Emotion

    Full text link
    Emotions are experienced and expressed differently across the world. In order to use Large Language Models (LMs) for multilingual tasks that require emotional sensitivity, LMs must reflect this cultural variation in emotion. In this study, we investigate whether the widely-used multilingual LMs in 2023 reflect differences in emotional expressions across cultures and languages. We find that embeddings obtained from LMs (e.g., XLM-RoBERTa) are Anglocentric, and generative LMs (e.g., ChatGPT) reflect Western norms, even when responding to prompts in other languages. Our results show that multilingual LMs do not successfully learn the culturally appropriate nuances of emotion and we highlight possible research directions towards correcting this.Comment: Accepted to WASSA at ACL 202

    Moral Foundations Twitter Corpus: A collection of 35k tweets annotated for moral sentiment

    No full text
    Research has shown that accounting for moral sentiment in natural language can yield insight into a variety of on- and off-line phenomena, such as message diffusion, protest dynamics, and social distancing. However, measuring moral sentiment in natural language is challenging and the difficulty of this task is exacerbated by the limited availability of annotated data. To address this issue, we introduce the Moral Foundations Twitter Corpus, a collection of 35,108 tweets that have been curated from seven distinct domains of discourse and hand-annotated by at least three trained annotators for 10 categories of moral sentiment. To facilitate investigations of annotator response dynamics, we also provide psychological and demographic meta-data for each annotator. Finally, we report moral sentiment classification baselines for this corpus using a range of popular methodologies

    Introducing the Gab Hate Corpus: Defining and applying hate-based rhetoric to social media posts at scale

    No full text
    We present the Gab Hate Corpus (GHC), consisting of 27,665 posts from the social network service gab.com, each annotated for the presence of “hate-based rhetoric” by a minimum of three annotators. Posts were labeled according to a coding typology derived from a synthesis of hate speech definitions across legal precedent, previous hate speech coding typologies, and definitions from psychology and sociology, comprising hierarchical labels indicating dehumanizing and violent speech as well as indicators of targeted groups and rhetorical framing. We provide inter-annotator agreement statistics and perform a classification analysis in order to validate the corpus and establish performance baselines. The GHC complements existing hate speech datasets in its theoretical grounding and by providing a large, representative sample of richly annotated social media posts

    The Gab Hate Corpus

    No full text
    The growing prominence of online hate speech is a threat to a safe and just society. This endangering phenomenon requires collaboration across the sciences in order to generate evidence-based knowledge of, and policies for, the dissemination of hatred in online spaces. To foster such collaborations, here we present the Gab Hate Corpus (GHC), consisting of 27,665 posts from the social network service gab.ai, each annotated by a minimum of three trained annotators. Annotators were trained to label posts according to a coding typology derived from a synthesis of hate speech definitions across legal, computational, psychological, and sociological research. We detail the development of the corpus, describe the resulting distributions of hate-based rhetoric, target group, and rhetorical framing labels, and establish baseline classification performance for each using standard natural language processing methods. The GHC, which is the largest theoretically-justified, annotated corpus of hate speech to date, provides opportunities for training and evaluating hate speech classifiers and for scientific inquiries into the linguistic and network components of hate speech
    corecore