3 research outputs found
HateBR: A Large Expert Annotated Corpus of Brazilian Instagram Comments for Offensive Language and Hate Speech Detection
Due to the severity of the social media offensive and hateful comments in
Brazil, and the lack of research in Portuguese, this paper provides the first
large-scale expert annotated corpus of Brazilian Instagram comments for hate
speech and offensive language detection. The HateBR corpus was collected from
the comment section of Brazilian politicians' accounts on Instagram and
manually annotated by specialists, reaching a high inter-annotator agreement.
The corpus consists of 7,000 documents annotated according to three different
layers: a binary classification (offensive versus non-offensive comments),
offensiveness-level classification (highly, moderately, and slightly
offensive), and nine hate speech groups (xenophobia, racism, homophobia,
sexism, religious intolerance, partyism, apology for the dictatorship,
antisemitism, and fatphobia). We also implemented baseline experiments for
offensive language and hate speech detection and compared them with a
literature baseline. Results show that the baseline experiments on our corpus
outperform the current state-of-the-art for the Portuguese language.Comment: Published at LREC 2022 Proceeding