16 research outputs found
Hate Speech and Offensive Language Detection in Bengali
Social media often serves as a breeding ground for various hateful and
offensive content. Identifying such content on social media is crucial due to
its impact on the race, gender, or religion in an unprejudiced society.
However, while there is extensive research in hate speech detection in English,
there is a gap in hateful content detection in low-resource languages like
Bengali. Besides, a current trend on social media is the use of Romanized
Bengali for regular interactions. To overcome the existing research's
limitations, in this study, we develop an annotated dataset of 10K Bengali
posts consisting of 5K actual and 5K Romanized Bengali tweets. We implement
several baseline models for the classification of such hateful posts. We
further explore the interlingual transfer mechanism to boost classification
performance. Finally, we perform an in-depth error analysis by looking into the
misclassified posts by the models. While training actual and Romanized datasets
separately, we observe that XLM-Roberta performs the best. Further, we witness
that on joint training and few-shot training, MuRIL outperforms other models by
interpreting the semantic expressions better. We make our code and dataset
public for others.Comment: Accepted at AACL-IJCNLP 202
Probing LLMs for hate speech detection: strengths and vulnerabilities
Recently efforts have been made by social media platforms as well as
researchers to detect hateful or toxic language using large language models.
However, none of these works aim to use explanation, additional context and
victim community information in the detection process. We utilise different
prompt variation, input information and evaluate large language models in zero
shot setting (without adding any in-context examples). We select three large
language models (GPT-3.5, text-davinci and Flan-T5) and three datasets -
HateXplain, implicit hate and ToxicSpans. We find that on average including the
target information in the pipeline improves the model performance substantially
(~20-30%) over the baseline across the datasets. There is also a considerable
effect of adding the rationales/explanations into the pipeline (~10-20%) over
the baseline across the datasets. In addition, we further provide a typology of
the error cases where these large language models fail to (i) classify and (ii)
explain the reason for the decisions they take. Such vulnerable points
automatically constitute 'jailbreak' prompts for these models and industry
scale safeguard techniques need to be developed to make the models robust
against such prompts.Comment: 13 pages, 9 figures, 7 tables, accepted to findings of EMNLP 202
Rationale-Guided Few-Shot Classification to Detect Abusive Language
Abusive language is a concerning problem in online social media. Past
research on detecting abusive language covers different platforms, languages,
demographies, etc. However, models trained using these datasets do not perform
well in cross-domain evaluation settings. To overcome this, a common strategy
is to use a few samples from the target domain to train models to get better
performance in that domain (cross-domain few-shot training). However, this
might cause the models to overfit the artefacts of those samples. A compelling
solution could be to guide the models toward rationales, i.e., spans of text
that justify the text's label. This method has been found to improve model
performance in the in-domain setting across various NLP tasks. In this paper,
we propose RGFS (Rationale-Guided Few-Shot Classification) for abusive language
detection. We first build a multitask learning setup to jointly learn
rationales, targets, and labels, and find a significant improvement of 6% macro
F1 on the rationale detection task over training solely rationale classifiers.
We introduce two rationale-integrated BERT-based architectures (the RGFS
models) and evaluate our systems over five different abusive language datasets,
finding that in the few-shot classification setting, RGFS-based models
outperform baseline models by about 7% in macro F1 scores and perform
competitively to models finetuned on other source domains. Furthermore,
RGFS-based models outperform LIME/SHAP-based approaches in terms of
plausibility and are close in performance in terms of faithfulness.Comment: 11 pages, 14 tables, 3 figures, The code repository is
https://github.com/punyajoy/RGFS_ECA
Thou shalt not hate: Countering Online Hate Speech
Hate content in social media is ever-increasing. While Facebook, Twitter,
Google have attempted to take several steps to tackle the hateful content, they
have mostly been unsuccessful. Counterspeech is seen as an effective way of
tackling the online hate without any harm to the freedom of speech. Thus, an
alternative strategy for these platforms could be to promote counterspeech as a
defense against hate content. However, in order to have a successful promotion
of such counterspeech, one has to have a deep understanding of its dynamics in
the online world. Lack of carefully curated data largely inhibits such
understanding. In this paper, we create and release the first ever dataset for
counterspeech using comments from YouTube. The data contains 13,924 manually
annotated comments where the labels indicate whether a comment is a
counterspeech or not. This data allows us to perform a rigorous measurement
study characterizing the linguistic structure of counterspeech for the first
time. This analysis results in various interesting insights such as: the
counterspeech comments receive much more likes as compared to the
non-counterspeech comments, for certain communities majority of the
non-counterspeech comments tend to be hate speech, the different types of
counterspeech are not all equally effective and the language choice of users
posting counterspeech is largely different from those posting non-counterspeech
as revealed by a detailed psycholinguistic analysis. Finally, we build a set of
machine learning models that are able to automatically detect counterspeech in
YouTube videos with an F1-score of 0.71. We also build multilabel models that
can detect different types of counterspeech in a comment with an F1-score of
0.60.Comment: Accepted at ICWSM 2019. 12 Pages, 5 Figures, and 7 Tables. The
dataset and models are available here:
https://github.com/binny-mathew/Countering_Hate_Speech_ICWSM201
Rationale-Guided Few-Shot Classification to Detect Abusive Language
Dataset for our paper with the same name accepted at ECAI 2023
https://github.com/punyajoy/RGFS_ECA