42 research outputs found
An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India
The pervasiveness of offensive content in social media has become an important reason for concern for online platforms. With the aim of improving online safety, a large number of studies applying computational models to identify such content have been published in the last few years, with promising results. The majority of these studies, however, deal with high-resource languages such as English due to the availability of datasets in these languages. Recent work has addressed offensive language identification from a low-resource perspective, exploring data augmentation strategies and trying to take advantage of existing multilingual pretrained models to cope with data scarcity in low-resource scenarios. In this work, we revisit the problem of low-resource offensive language identification by evaluating the performance of multilingual transformers in offensive language identification for languages spoken in India. We investigate languages from different families such as Indo-Aryan (e.g., Bengali, Hindi, and Urdu) and Dravidian (e.g., Tamil, Malayalam, and Kannada), creating important new technology for these languages. The results show that multilingual offensive language identification models perform better than monolingual models and that cross-lingual transformers show strong zero-shot and few-shot performance across languages
LCT-1 at SemEval-2023 Task 10: Pre-training and Multi-task Learning for Sexism Detection and Classification
Misogyny and sexism are growing problems in social media. Advances have been
made in online sexism detection but the systems are often uninterpretable.
SemEval-2023 Task 10 on Explainable Detection of Online Sexism aims at
increasing explainability of the sexism detection, and our team participated in
all the proposed subtasks. Our system is based on further domain-adaptive
pre-training (Gururangan et al., 2020). Building on the Transformer-based
models with the domain adaptation, we compare fine-tuning with multi-task
learning and show that each subtask requires a different system configuration.
In our experiments, multi-task learning performs on par with standard
fine-tuning for sexism detection and noticeably better for coarse-grained
sexism classification, while fine-tuning is preferable for fine-grained
classification
An Empirical Study of Offensive Language in Online Interactions
In the past decade, usage of social media platforms has increased significantly. People use these platforms to connect with friends and family, share information, news and opinions. Platforms such as Facebook, Twitter are often used to propagate offensive and hateful content online. The open nature and anonymity of the internet fuels aggressive and inflamed conversations. The companies and federal institutions are striving to make social media cleaner, welcoming and unbiased. In this study, we first explore the underlying topics in popular offensive language datasets using statistical and neural topic modeling. The current state-of-the-art models for aggression detection only present a toxicity score based on the entire post. Content moderators often have to deal with lengthy texts without any word-level indicators. We propose a neural transformer approach for detecting the tokens that make a particular post aggressive. The pre-trained BERT model has achieved state-of-the-art results in various natural language processing tasks. However, the model is trained on general-purpose corpora and lacks aggressive social media linguistic features. We propose fBERT, a retrained BERT model with over million offensive tweets from the SOLID dataset. We demonstrate the effectiveness and portability of fBERT over BERT in various shared offensive language detection tasks. We further propose a new multi-task aggression detection (MAD) framework for post and token-level aggression detection using neural transformers. The experiments confirm the effectiveness of the multi-task learning model over individual models; particularly when the number of training data is limited
SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)
We present the results and main findings of SemEval-2020 Task 12 on
Multilingual Offensive Language Identification in Social Media (OffensEval
2020). The task involves three subtasks corresponding to the hierarchical
taxonomy of the OLID schema (Zampieri et al., 2019a) from OffensEval 2019. The
task featured five languages: English, Arabic, Danish, Greek, and Turkish for
Subtask A. In addition, English also featured Subtasks B and C. OffensEval 2020
was one of the most popular tasks at SemEval-2020 attracting a large number of
participants across all subtasks and also across all languages. A total of 528
teams signed up to participate in the task, 145 teams submitted systems during
the evaluation period, and 70 submitted system description papers.Comment: Proceedings of the International Workshop on Semantic Evaluation
(SemEval-2020