Search CORE

324 research outputs found

FlorUniTo@TRAC-2: Retrofitting Word Embeddings on an Abusive Lexicon for Aggressive Language Detection

Author: Anna Koufakou
Basile Valerio
Patti Viviana
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2020
Field of study

Institutional Research Information System University of Turin

Mapping (Dis-)Information Flow about the MH17 Plane Crash

Author: Augenstein Isabelle
Golovchenko Yevgeniy
Hartmann Mareike
Publication venue
Publication date: 01/01/2019
Field of study

Digital media enables not only fast sharing of information, but also disinformation. One prominent case of an event leading to circulation of disinformation on social media is the MH17 plane crash. Studies analysing the spread of information about this event on Twitter have focused on small, manually annotated datasets, or used proxys for data annotation. In this work, we examine to what extent text classifiers can be used to label data for subsequent content analysis, in particular we focus on predicting pro-Russian and pro-Ukrainian Twitter content related to the MH17 plane crash. Even though we find that a neural classifier improves over a hashtag based baseline, labeling pro-Russian and pro-Ukrainian content with high precision remains a challenging problem. We provide an error analysis underlining the difficulty of the task and identify factors that might help improve classification in future work. Finally, we show how the classifier can facilitate the annotation task for human annotators

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

Enhancing Hate Speech Detection in Sinhala Language on Social Media using Machine Learning

Author: Deng Jeremiah D.
Fernando Eranga N.
Publication venue: AIS Electronic Library (AISeL)
Publication date: 02/12/2023
Field of study

To counter the harmful dissemination of hate speech on social media, especially abusive outbursts of racism and sexism, automatic and accurate detection is crucial. However, a significant challenge lies in the vast sparsity of available data, hindering accurate classification. This study presents a novel approach to Sinhala hate speech detection on social platforms by coupling a global feature selection process with traditional machine learning, the research scrutinizes hate speech intricacies. A class-based variable feature selection process evaluates significance via global and local scores, identifying optimal values for prevalent classifiers. Utilizing class-based and corpus-based evaluations, we pinpoint optimal feature values for classifiers like SVM, MNB, and RF. Our results reveal notable enhancements in performance, specifically the F1-Score, underscoring how feature selection and parameter tuning work in tandem to boost model efficacy. Furthermore, the study explores nuanced variations in classifier performance across training and testing datasets, emphasizing the importance of model generalization

AIS Electronic Library (AISeL)

Achieving Hate Speech Detection in a Low Resource Setting

Author: Li Peiyu
Publication venue: DigitalCommons@USU
Publication date: 01/05/2021
Field of study

Online social networks provide people with convenient platforms to communicate and share life moments. However, because of the anonymous property of these social media platforms, the cases of online hate speeches are increasing. Hate speech is defined by the Cambridge Dictionary as “public speech that expresses hate or encourages violence towards a person or group based on something such as race, religion, sex, or sexual orientation”. Online hate speech has caused serious negative effects to legitimate users, including mental or emotional stress, reputational damage, and fear for one’s safety. To protect legitimate online users, automatically hate speech detection techniques are deployed on various social media. However, most of the existing hate speech detection models require a large amount of labeled data for training. In the thesis, we focus on achieving hate speech detection without using many labeled samples. In particular, we focus on three scenarios of hate speech detection and propose three corresponding approaches. (i) When we only have limited labeled data for one social media platform, we fine-tune a per-trained language model to conduct hate speech detection on the specific platform. (ii) When we have data from several social media platforms, each of which only has a small size of labeled data, we develop a multitask learning model to detect hate speech on several platforms in parallel. (iii) When we aim to conduct hate speech on a new social media platform, where we do not have any labeled data for this platform, we propose to use domain adaptation to transfer knowledge from some other related social media platforms to conduct hate speech detection on the new platform. Empirical studies show that our proposed approaches can achieve good performance on hate speech detection in a low resource setting

DigitalCommons@USU

Topic Modeling based text classification regarding Islamophobia using Word Embedding and Transformers Techniques

Author: Imran Talha
Kamran Muhammad
Khan Danish
Khan Hikmat Ullah
Khan Muhammad Attique
Saeed Ammar
Shankar Achyut
Publication venue: Association for Computing Machinery (ACM)
Publication date: 01/11/2023
Field of study

Islamophobia is a rising area of concern in the current era where Muslims face discrimination and receive negative perspectives towards their religion, Islam. Islamophobia is a type of racism that is being practiced by individuals, groups, and organizations worldwide. Moreover, the ease of access to social media platforms and their augmented usage has also contributed to spreading hate speech, false information, and negative opinions about Islam. In this research study, we focused to detect Islamophobic textual content shared on various social media platforms. We explored the state-of-the-art techniques being followed in text data mining and Natural Language Processing (NLP). Topic modelling algorithm Latent Dirichlet Allocation is used to find top topics. Then, word embedding approaches such as Word2Vec and Global Vectors for word representation (GloVe) are used as feature extraction techniques. For text classification, we utilized modern text analysis techniques of transformers-based Deep Learning algorithms named Bidirectional Encoders Representation from Transformers (BERT) and Generative Pre-Trained Transformer (GPT). For results comparison, we conducted an extensive empirical analysis of Machine Learning algorithms and Deep Learning using conventional textual features such as the Term Frequency-Inverse Document Frequency, N-gram, and Bag of words (BoW). The empirical based results evaluated using standard performance evaluation measures show that the proposed approach effectively detects the textual content related to Islamophobia. In the corpus of the study under Machine Learning models Support Vector Machine (SVM) performed best with an F1 score of 91%. The Transformer based core NLP models and the Deep Learning model Convolutional Neural Network (CNN) when combined with GloVe performed best among all the techniques except SVM with BoW. GPT, SVM when combined with BoW and BERT yielded the best F1 score of 92%, 92% and 91.9% respectively, while CNN performed slightly poor with an F1 score of 91%

Warwick Research Archives Portal Repository

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Author: Kohler Michelle
Maxwell-Smith Zara
Suominen Hanna
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 27/10/2022
Field of study

Indonesian and Malay are underrepresented in the development of natural language processing (NLP) technologies and available resources are difficult to find. A clear picture of existing work can invigorate and inform how researchers conceptualise worthwhile projects. Using an education sector project to motivate the study, we conducted a wide-ranging overview of Indonesian and Malay human language technologies and corpus work. We charted 657 included studies according to Hirschberg and Manning's 2015 description of NLP, concluding that the field was dominated by exploratory corpus work, machine reading of text gathered from the Internet, and sentiment analysis. In this paper, we identify most published authors and research hubs, and make a number of recommendations to encourage future collaboration and efficiency within NLP in Indonesian and Malay

UTUPub