677 research outputs found

    The right kind of explanation: Validity in automated hate speech detection

    Get PDF
    To quickly identify hate speech online, communication research offers a useful tool in the form of automatic content analysis. However, the combined methods of standardized manual content analysis and supervised text classification demand different quality criteria. This chapter shows that a more substantial examination of validity is necessary since models often learn on spurious correlations or biases, and researchers run the risk of drawing wrong inferences. To investigate the overlap of theoretical concepts with technological operationalization, explainability methods are evaluated to explain what a model has learned. These methods proved to be of limited use in testing the validity of a model when the generated explanations aim at sense-making rather than faithfulness to the model. The chapter ends with recommendations for further interdisciplinary development of automatic content analysis

    Not All Comments are Equal: Insights into Comment Moderation from a Topic-Aware Model

    Get PDF
    Moderation of reader comments is a significant problem for online news platforms. Here, we experiment with models for automatic moderation, using a dataset of comments from a popular Croatian newspaper. Our analysis shows that while comments that violate the moderation rules mostly share common linguistic and thematic features, their content varies across the different sections of the newspaper. We therefore make our models topic-aware, incorporating semantic features from a topic model into the classification decision. Our results show that topic information improves the performance of the model, increases its confidence in correct outputs, and helps us understand the model's outputs

    Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models

    Full text link
    The advent of social media has given rise to numerous ethical challenges, with hate speech among the most significant concerns. Researchers are attempting to tackle this problem by leveraging hate-speech detection and employing language models to automatically moderate content and promote civil discourse. Unfortunately, recent studies have revealed that hate-speech detection systems can be misled by adversarial attacks, raising concerns about their resilience. While previous research has separately addressed the robustness of these models under adversarial attacks and their interpretability, there has been no comprehensive study exploring their intersection. The novelty of our work lies in combining these two critical aspects, leveraging interpretability to identify potential vulnerabilities and enabling the design of targeted adversarial attacks. We present a comprehensive and comparative analysis of adversarial robustness exhibited by various hate-speech detection models. Our study evaluates the resilience of these models against adversarial attacks using explainability techniques. To gain insights into the models' decision-making processes, we employ the Local Interpretable Model-agnostic Explanations (LIME) framework. Based on the explainability results obtained by LIME, we devise and execute targeted attacks on the text by leveraging the TextAttack tool. Our findings enhance the understanding of the vulnerabilities and strengths exhibited by state-of-the-art hate-speech detection models. This work underscores the importance of incorporating explainability in the development and evaluation of such models to enhance their resilience against adversarial attacks. Ultimately, this work paves the way for creating more robust and reliable hate-speech detection systems, fostering safer online environments and promoting ethical discourse on social media platforms

    Recalibrating classifiers for interpretable abusive content detection

    Get PDF
    Dataset and code for the paper, 'Recalibrating classifiers for interpretable abusive content detection' by Vidgen et al. (2020) -- to appear at the NLP + CSS workshop at EMNLP 2020. We provide: 1,000 annotated tweets, sampled using the Davidson classifier with 20 0.05 increments (50 from each) from a dataset of tweets directed against MPs in the UK 2017 General Election 1,000 annotated tweets, sampled using the Perspective classifier with 20 0.05 increments (50 from each) from a dataset of tweets directed against MPs in the UK 2017 General Election Code for recalibration in R and STAN. Annotation guidelines for both datasets. Paper abstract We investigate the use of machine learning classifiers for detecting online abuse in empirical research. We show that uncalibrated classifiers (i.e. where the 'raw' scores are used) align poorly with human evaluations. This limits their use to understand the dynamics, patterns and prevalence of online abuse. We examine two widely used classifiers (created by Perspective and Davidson et al.) on a dataset of tweets directed against candidates in the UK's 2017 general election. A Bayesian approach is presented to recalibrate the raw scores from the classifiers, using probabilistic programming and newly annotated data. We argue that interpretability evaluation and recalibration is integral to the application of abusive content classifiers

    "Help Me Help the AI": Understanding How Explainability Can Support Human-AI Interaction

    Full text link
    Despite the proliferation of explainable AI (XAI) methods, little is understood about end-users' explainability needs. This gap is critical, because end-users may have needs that XAI methods should but don't yet support. To address this gap and contribute to understanding how explainability can support human-AI interaction, we conducted a study of a real-world AI application via interviews with 20 end-users of Merlin, a bird-identification app. We found that people express a need for practically useful information that can improve their collaboration with the AI system, and intend to use XAI explanations for calibrating trust, improving their task skills, changing their behavior to supply better inputs to the AI system, and giving constructive feedback to developers. We also assessed end-users' perceptions of existing XAI approaches, finding that they prefer part-based explanations. Finally, we discuss implications of our findings and provide recommendations for future designs of XAI, specifically XAI for human-AI collaboration

    Challenges and perspectives of hate speech research

    Get PDF
    This book is the result of a conference that could not take place. It is a collection of 26 texts that address and discuss the latest developments in international hate speech research from a wide range of disciplinary perspectives. This includes case studies from Brazil, Lebanon, Poland, Nigeria, and India, theoretical introductions to the concepts of hate speech, dangerous speech, incivility, toxicity, extreme speech, and dark participation, as well as reflections on methodological challenges such as scraping, annotation, datafication, implicity, explainability, and machine learning. As such, it provides a much-needed forum for cross-national and cross-disciplinary conversations in what is currently a very vibrant field of research

    The Design and Evaluation of Neural Attention Mechanisms for Explaining Text Classifiers

    Full text link
    The last several years have seen a surge of interest in interpretability in AI and machine learning--the idea of producing human-understandable explanations for AI model behavior. This interest has grown out of concerns about the robustness and accountability of AI-driven systems, particularly deep neural networks, in light of the increasing ubiquity of such systems in industry, science and government. The general hope of the field is that by producing explanations of model behavior for human consumption, one or more model-using stakeholder groups (e.g. model designers, model-advised decision-makers, recipients of model-driven decisions) will be able to derive some type of increased utility from those models (e.g. easier model debugging, better decision-making, higher user satisfaction). The early years of this field have seen a profusion of technique but a paucity of evaluation. A number of methods have been proposed for explaining the decisions of deep neural models, or of constraining neural models to behave in more interpretable ways. However, it has proven difficult for the community to reach a consensus about how to evaluate the quality of such methods. Automated evaluation protocols such as collecting gold-standard explanations do not necessarily correlate well with true practical utility, while fully application-oriented evaluations are expensive, difficult to generalize from, and, it increasingly appears, an extremely difficult HCI challenge. In this work I address gaps in both the design and evaluation of interpretability methods for text classifiers. I present two novel interpretability methods. The first method is a feature-based explanation technique which uses an adversarial attention mechanism to identify all predictive signal in the body of an input text, allowing it to outperform strong baselines with respect to human gold-standard annotations. The second method is an example-based technique that retrieves explanatory examples using only the features that were important to a given prediction, leading to examples which are much more relevant than those produced by strong baselines. I accompany each method with a formal user study evaluating whether that type of explanation improves human performance in model-assisted decision-making. In neither study am I able to demonstrate an improvement in human performance as an effect of explanation presence. This, along with other recent results in the interpretability literature, begins to reveal an intriguing expectation gap between the enthusiasm that the interpretability topic has engendered in the machine learning community and the actual utility of these techniques in terms of human outcomes that the community has been able to demonstrate. Both studies represent contributions to the design of evaluation studies for interpretable machine learning. The second study in particular is one of the first human evaluations of example-based explanations for neural text classifiers. Its outcome reveals several important, nonobvious design issues in example-based explanation systems which should helpfully inform future work on the topic.PHDInformationUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/153357/1/scarton_1.pd
    • …
    corecore