11 research outputs found
Detecting Hate Speech in Social Media
In this paper we examine methods to detect hate speech in social media, while
distinguishing this from general profanity. We aim to establish lexical
baselines for this task by applying supervised classification methods using a
recently released dataset annotated for this purpose. As features, our system
uses character n-grams, word n-grams and word skip-grams. We obtain results of
78% accuracy in identifying posts across three classes. Results demonstrate
that the main challenge lies in discriminating profanity and hate speech from
each other. A number of directions for future work are discussed.Comment: Proceedings of Recent Advances in Natural Language Processing
(RANLP). pp. 467-472. Varna, Bulgari
Twist Bytes : German dialect identification with data mining optimization
We describe our approaches used in the German Dialect Identification (GDI) task at the VarDial Evaluation Campaign 2018. The goal was to identify to which out of four dialects spoken in German speaking part of Switzerland a sentence belonged to. We adopted two different metaclassifier approaches and used some data mining insights to improve the preprocessing and the meta-classifier parameters. Especially, we focused on using different feature extraction methods and how to combine them, since they influenced the performance very differently of the system. Our system achieved second place out of 8 teams, with a macro averaged F-1 of 64.6%. We also participated on the surprise dialect task with a multi-label approach
Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign
We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.Non peer reviewe