4 research outputs found
QutNocturnal@HASOC'19: CNN for Hate Speech and Offensive Content Identification in Hindi Language
We describe our top-team solution to Task 1 for Hindi in the HASOC contest
organised by FIRE 2019. The task is to identify hate speech and offensive
language in Hindi. More specifically, it is a binary classification problem
where a system is required to classify tweets into two classes: (a) \emph{Hate
and Offensive (HOF)} and (b) \emph{Not Hate or Offensive (NOT)}. In contrast to
the popular idea of pretraining word vectors (a.k.a. word embedding) with a
large corpus from a general domain such as Wikipedia, we used a relatively
small collection of relevant tweets (i.e. random and sarcasm tweets in Hindi
and Hinglish) for pretraining. We trained a Convolutional Neural Network (CNN)
on top of the pretrained word vectors. This approach allowed us to be ranked
first for this task out of all teams. Our approach could easily be adapted to
other applications where the goal is to predict class of a text when the
provided context is limited
An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India
The pervasiveness of offensive content in social media has become an important reason for concern for online platforms. With the aim of improving online safety, a large number of studies applying computational models to identify such content have been published in the last few years, with promising results. The majority of these studies, however, deal with high-resource languages such as English due to the availability of datasets in these languages. Recent work has addressed offensive language identification from a low-resource perspective, exploring data augmentation strategies and trying to take advantage of existing multilingual pretrained models to cope with data scarcity in low-resource scenarios. In this work, we revisit the problem of low-resource offensive language identification by evaluating the performance of multilingual transformers in offensive language identification for languages spoken in India. We investigate languages from different families such as Indo-Aryan (e.g., Bengali, Hindi, and Urdu) and Dravidian (e.g., Tamil, Malayalam, and Kannada), creating important new technology for these languages. The results show that multilingual offensive language identification models perform better than monolingual models and that cross-lingual transformers show strong zero-shot and few-shot performance across languages
Deep learning for religious and continent-based toxic content detection and classification
With time, numerous online communication platforms have emerged that allow people to express themselves, increasing the dissemination of toxic languages, such as racism, sexual harassment, and other negative behaviors that are not accepted in polite society. As a result, toxic language identification in online communication has emerged as a critical application of natural language processing. Numerous academic and industrial researchers have recently researched toxic language identification using machine learning algorithms. However, Nontoxic comments, including particular identification descriptors, such as Muslim, Jewish, White, and Black, were assigned unrealistically high toxicity ratings in several machine learning models. This research analyzes and compares modern deep learning algorithms for multilabel toxic comments classification. We explore two scenarios: the first is a multilabel classification of Religious toxic comments, and the second is a multilabel classification of race or toxic ethnicity comments with various word embeddings (GloVe, Word2vec, and FastText) without word embeddings using an ordinary embedding layer. Experiments show that the CNN model produced the best results for classifying multilabel toxic comments in both scenarios. We compared the outcomes of these modern deep learning model performances in terms of multilabel evaluation metrics