130 research outputs found
Multilingual Twitter Corpus and Baselines for Evaluating Demographic Bias in Hate Speech Recognition
Existing research on fairness evaluation of document classification models
mainly uses synthetic monolingual data without ground truth for author
demographic attributes. In this work, we assemble and publish a multilingual
Twitter corpus for the task of hate speech detection with inferred four author
demographic factors: age, country, gender and race/ethnicity. The corpus covers
five languages: English, Italian, Polish, Portuguese and Spanish. We evaluate
the inferred demographic labels with a crowdsourcing platform, Figure Eight. To
examine factors that can cause biases, we take an empirical analysis of
demographic predictability on the English corpus. We measure the performance of
four popular document classifiers and evaluate the fairness and bias of the
baseline classifiers on the author-level demographic attributes.Comment: Accepted at LREC 202
Recommended from our members
Metadata Matters: Adaptation Methods For Robust Document Classification
Metadata, implicitly embedded in documents such as time, demographic factors and user interests, can cause language variations and impact performance of document classifiers. For example, language shifts over periods of time, and males and females express sentiment differently. However, models for document classification, the automatic categorization of documents into categories, typically ignore document metadata. In this thesis, we focus on two types of document metadata, temporality and user factors. We propose to use domain adaptation by treating each metadata attribute as domains (e.g., gender domains: male vs. female), aiming to integrate temporality and user factors into document classifiers and improve classification performance.
First, we propose temporality adaptation that explicitly incorporates time into the representation learning process via feature augmentation and diachronic word embedding. The feature augmentation method aims to learn time-independent feature weights for document classifiers. We then develop an end-to-end time-adapted model with the diachronic word embedding under a time-driven framework. Second, we propose user factor adaptation that models demographic attributes and user interests using multitask learning. To model demographic attributes, document classifiers jointly predict demographic factors and document categories. We further develop a multitask user embedding that jointly learns language, user behaviors and user interests. We examine and visualize impacts of temporality and user factor on word, topic, semantic and classifier levels.
Benefits of adapting demographic attributes motivate us to examine if domain adaptation can reduce demographic biases. We release a multilingual hate speech corpus with author-level demographic labels. We examine demographic variations of user language and demographic biases of document classifiers. Following this, to reduce demographic bias, we apply a feature augmentation method to learn demographic-independent classifiers.</p
Benchmarking Arabic AI with Large Language Models
With large Foundation Models (FMs), language technologies (AI in general) are
entering a new paradigm: eliminating the need for developing large-scale
task-specific datasets and supporting a variety of tasks through set-ups
ranging from zero-shot to few-shot learning. However, understanding FMs
capabilities requires a systematic benchmarking effort by comparing FMs
performance with the state-of-the-art (SOTA) task-specific models. With that
goal, past work focused on the English language and included a few efforts with
multiple languages. Our study contributes to ongoing research by evaluating FMs
performance for standard Arabic NLP and Speech processing, including a range of
tasks from sequence tagging to content classification across diverse domains.
We start with zero-shot learning using GPT-3.5-turbo, Whisper, and USM,
addressing 33 unique tasks using 59 publicly available datasets resulting in 96
test setups. For a few tasks, FMs performs on par or exceeds the performance of
the SOTA models but for the majority it under-performs. Given the importance of
prompt for the FMs performance, we discuss our prompt strategies in detail and
elaborate on our findings. Our future work on Arabic AI will explore few-shot
prompting, expand the range of tasks, and investigate additional open-source
models.Comment: Foundation Models, Large Language Models, Arabic NLP, Arabic Speech,
Arabic AI, , CHatGPT Evaluation, USM Evaluation, Whisper Evaluatio
Human-in-the-Loop Hate Speech Classification in a Multilingual Context
The shift of public debate to the digital sphere has been accompanied by a rise in online hate speech. While many promising approaches for hate speech classification have been pro- posed, studies often focus only on a single language, usually English, and do not address three key concerns: post-deployment perfor- mance, classifier maintenance and infrastruc- tural limitations. In this paper, we introduce a new human-in-the-loop BERT-based hate speech classification pipeline and trace its de- velopment from initial data collection and an- notation all the way to post-deployment. Our classifier, trained using data from our original corpus of over 422k examples, is specifically developed for the inherently multilingual set- ting of Switzerland and outperforms with its F1 score of 80.5 the currently best-performing BERT-based multilingual classifier by 5.8 F1 points in German and 3.6 F1 points in French. Our systematic evaluations over a 12-month period further highlight the vital importance of continuous, human-in-the-loop classifier main- tenance to ensure robust hate speech classifica- tion post-deployment
Deep learning for religious and continent-based toxic content detection and classification
With time, numerous online communication platforms have emerged that allow people to express themselves, increasing the dissemination of toxic languages, such as racism, sexual harassment, and other negative behaviors that are not accepted in polite society. As a result, toxic language identification in online communication has emerged as a critical application of natural language processing. Numerous academic and industrial researchers have recently researched toxic language identification using machine learning algorithms. However, Nontoxic comments, including particular identification descriptors, such as Muslim, Jewish, White, and Black, were assigned unrealistically high toxicity ratings in several machine learning models. This research analyzes and compares modern deep learning algorithms for multilabel toxic comments classification. We explore two scenarios: the first is a multilabel classification of Religious toxic comments, and the second is a multilabel classification of race or toxic ethnicity comments with various word embeddings (GloVe, Word2vec, and FastText) without word embeddings using an ordinary embedding layer. Experiments show that the CNN model produced the best results for classifying multilabel toxic comments in both scenarios. We compared the outcomes of these modern deep learning model performances in terms of multilabel evaluation metrics
Dissecting Deep Language Models: The Explainability and Bias Perspective
L'abstract è presente nell'allegato / the abstract is in the attachmen
Gen-Z: Generative Zero-Shot Text Classification with Contextualized Label Descriptions
Language model (LM) prompting--a popular paradigm for solving NLP tasks--has
been shown to be susceptible to miscalibration and brittleness to slight prompt
variations, caused by its discriminative prompting approach, i.e., predicting
the label given the input. To address these issues, we propose Gen-Z--a
generative prompting framework for zero-shot text classification. GEN-Z is
generative, as it measures the LM likelihood of input text, conditioned on
natural language descriptions of labels. The framework is multivariate, as
label descriptions allow us to seamlessly integrate additional contextual
information about the labels to improve task performance. On various standard
classification benchmarks, with six open-source LM families, we show that
zero-shot classification with simple contextualization of the data source of
the evaluation set consistently outperforms both zero-shot and few-shot
baselines while improving robustness to prompt variations. Further, our
approach enables personalizing classification in a zero-shot manner by
incorporating author, subject, or reader information in the label descriptions
EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020
Welcome to EVALITA 2020! EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. EVALITA is an initiative of the Italian Association for Computational Linguistics (AILC, http://www.ai-lc.it) and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA, http://www.aixia.it) and the Italian Association for Speech Sciences (AISV, http://www.aisv.it)
- …