324 research outputs found

    Mapping (Dis-)Information Flow about the MH17 Plane Crash

    Get PDF
    Digital media enables not only fast sharing of information, but also disinformation. One prominent case of an event leading to circulation of disinformation on social media is the MH17 plane crash. Studies analysing the spread of information about this event on Twitter have focused on small, manually annotated datasets, or used proxys for data annotation. In this work, we examine to what extent text classifiers can be used to label data for subsequent content analysis, in particular we focus on predicting pro-Russian and pro-Ukrainian Twitter content related to the MH17 plane crash. Even though we find that a neural classifier improves over a hashtag based baseline, labeling pro-Russian and pro-Ukrainian content with high precision remains a challenging problem. We provide an error analysis underlining the difficulty of the task and identify factors that might help improve classification in future work. Finally, we show how the classifier can facilitate the annotation task for human annotators

    Enhancing Hate Speech Detection in Sinhala Language on Social Media using Machine Learning

    Get PDF
    To counter the harmful dissemination of hate speech on social media, especially abusive outbursts of racism and sexism, automatic and accurate detection is crucial. However, a significant challenge lies in the vast sparsity of available data, hindering accurate classification. This study presents a novel approach to Sinhala hate speech detection on social platforms by coupling a global feature selection process with traditional machine learning, the research scrutinizes hate speech intricacies. A class-based variable feature selection process evaluates significance via global and local scores, identifying optimal values for prevalent classifiers. Utilizing class-based and corpus-based evaluations, we pinpoint optimal feature values for classifiers like SVM, MNB, and RF. Our results reveal notable enhancements in performance, specifically the F1-Score, underscoring how feature selection and parameter tuning work in tandem to boost model efficacy. Furthermore, the study explores nuanced variations in classifier performance across training and testing datasets, emphasizing the importance of model generalization

    Achieving Hate Speech Detection in a Low Resource Setting

    Get PDF
    Online social networks provide people with convenient platforms to communicate and share life moments. However, because of the anonymous property of these social media platforms, the cases of online hate speeches are increasing. Hate speech is defined by the Cambridge Dictionary as “public speech that expresses hate or encourages violence towards a person or group based on something such as race, religion, sex, or sexual orientation”. Online hate speech has caused serious negative effects to legitimate users, including mental or emotional stress, reputational damage, and fear for one’s safety. To protect legitimate online users, automatically hate speech detection techniques are deployed on various social media. However, most of the existing hate speech detection models require a large amount of labeled data for training. In the thesis, we focus on achieving hate speech detection without using many labeled samples. In particular, we focus on three scenarios of hate speech detection and propose three corresponding approaches. (i) When we only have limited labeled data for one social media platform, we fine-tune a per-trained language model to conduct hate speech detection on the specific platform. (ii) When we have data from several social media platforms, each of which only has a small size of labeled data, we develop a multitask learning model to detect hate speech on several platforms in parallel. (iii) When we aim to conduct hate speech on a new social media platform, where we do not have any labeled data for this platform, we propose to use domain adaptation to transfer knowledge from some other related social media platforms to conduct hate speech detection on the new platform. Empirical studies show that our proposed approaches can achieve good performance on hate speech detection in a low resource setting

    Topic Modeling based text classification regarding Islamophobia using Word Embedding and Transformers Techniques

    Get PDF
    Islamophobia is a rising area of concern in the current era where Muslims face discrimination and receive negative perspectives towards their religion, Islam. Islamophobia is a type of racism that is being practiced by individuals, groups, and organizations worldwide. Moreover, the ease of access to social media platforms and their augmented usage has also contributed to spreading hate speech, false information, and negative opinions about Islam. In this research study, we focused to detect Islamophobic textual content shared on various social media platforms. We explored the state-of-the-art techniques being followed in text data mining and Natural Language Processing (NLP). Topic modelling algorithm Latent Dirichlet Allocation is used to find top topics. Then, word embedding approaches such as Word2Vec and Global Vectors for word representation (GloVe) are used as feature extraction techniques. For text classification, we utilized modern text analysis techniques of transformers-based Deep Learning algorithms named Bidirectional Encoders Representation from Transformers (BERT) and Generative Pre-Trained Transformer (GPT). For results comparison, we conducted an extensive empirical analysis of Machine Learning algorithms and Deep Learning using conventional textual features such as the Term Frequency-Inverse Document Frequency, N-gram, and Bag of words (BoW). The empirical based results evaluated using standard performance evaluation measures show that the proposed approach effectively detects the textual content related to Islamophobia. In the corpus of the study under Machine Learning models Support Vector Machine (SVM) performed best with an F1 score of 91%. The Transformer based core NLP models and the Deep Learning model Convolutional Neural Network (CNN) when combined with GloVe performed best among all the techniques except SVM with BoW. GPT, SVM when combined with BoW and BERT yielded the best F1 score of 92%, 92% and 91.9% respectively, while CNN performed slightly poor with an F1 score of 91%

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

    Get PDF
    Indonesian and Malay are underrepresented in the development of natural language processing (NLP) technologies and available resources are difficult to find. A clear picture of existing work can invigorate and inform how researchers conceptualise worthwhile projects. Using an education sector project to motivate the study, we conducted a wide-ranging overview of Indonesian and Malay human language technologies and corpus work. We charted 657 included studies according to Hirschberg and Manning's 2015 description of NLP, concluding that the field was dominated by exploratory corpus work, machine reading of text gathered from the Internet, and sentiment analysis. In this paper, we identify most published authors and research hubs, and make a number of recommendations to encourage future collaboration and efficiency within NLP in Indonesian and Malay
    • …
    corecore