2 research outputs found
Overview of Abusive and Threatening Language Detection in Urdu at FIRE 2021
With the growth of social media platform influence, the effect of their
misuse becomes more and more impactful. The importance of automatic detection
of threatening and abusive language can not be overestimated. However, most of
the existing studies and state-of-the-art methods focus on English as the
target language, with limited work on low- and medium-resource languages. In
this paper, we present two shared tasks of abusive and threatening language
detection for the Urdu language which has more than 170 million speakers
worldwide. Both are posed as binary classification tasks where participating
systems are required to classify tweets in Urdu into two classes, namely: (i)
Abusive and Non-Abusive for the first task, and (ii) Threatening and
Non-Threatening for the second. We present two manually annotated datasets
containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening
and Non-Threatening. The abusive dataset contains 2400 annotated tweets in the
train part and 1100 annotated tweets in the test part. The threatening dataset
contains 6000 annotated tweets in the train part and 3950 annotated tweets in
the test part. We also provide logistic regression and BERT-based baseline
classifiers for both tasks. In this shared task, 21 teams from six countries
registered for participation (India, Pakistan, China, Malaysia, United Arab
Emirates, and Taiwan), 10 teams submitted their runs for Subtask A, which is
Abusive Language Detection and 9 teams submitted their runs for Subtask B,
which is Threatening Language detection, and seven teams submitted their
technical reports. The best performing system achieved an F1-score value of
0.880 for Subtask A and 0.545 for Subtask B. For both subtasks, m-Bert based
transformer model showed the best performance