5,111 research outputs found

    Detecting and Monitoring Hate Speech in Twitter

    Get PDF
    Social Media are sensors in the real world that can be used to measure the pulse of societies. However, the massive and unfiltered feed of messages posted in social media is a phenomenon that nowadays raises social alarms, especially when these messages contain hate speech targeted to a specific individual or group. In this context, governments and non-governmental organizations (NGOs) are concerned about the possible negative impact that these messages can have on individuals or on the society. In this paper, we present HaterNet, an intelligent system currently being used by the Spanish National Office Against Hate Crimes of the Spanish State Secretariat for Security that identifies and monitors the evolution of hate speech in Twitter. The contributions of this research are many-fold: (1) It introduces the first intelligent system that monitors and visualizes, using social network analysis techniques, hate speech in Social Media. (2) It introduces a novel public dataset on hate speech in Spanish consisting of 6000 expert-labeled tweets. (3) It compares several classification approaches based on different document representation strategies and text classification models. (4) The best approach consists of a combination of a LTSM+MLP neural network that takes as input the tweetโ€™s word, emoji, and expression tokensโ€™ embeddings enriched by the tf-idf, and obtains an area under the curve (AUC) of 0.828 on our dataset, outperforming previous methods presented in the literatureThe work by Quijano-Sanchez was supported by the Spanish Ministry of Science and Innovation grant FJCI-2016-28855. The research of Liberatore was supported by the Government of Spain, grant MTM2015-65803-R, and by the European Unionโ€™s Horizon 2020 Research and Innovation Programme, under the Marie Sklodowska-Curie grant agreement No. 691161 (GEOSAFE). All the financial support is gratefully acknowledge

    "HOT" ChatGPT: The promise of ChatGPT in detecting and discriminating hateful, offensive, and toxic comments on social media

    Full text link
    Harmful content is pervasive on social media, poisoning online communities and negatively impacting participation. A common approach to address this issue is to develop detection models that rely on human annotations. However, the tasks required to build such models expose annotators to harmful and offensive content and may require significant time and cost to complete. Generative AI models have the potential to understand and detect harmful content. To investigate this potential, we used ChatGPT and compared its performance with MTurker annotations for three frequently discussed concepts related to harmful content: Hateful, Offensive, and Toxic (HOT). We designed five prompts to interact with ChatGPT and conducted four experiments eliciting HOT classifications. Our results show that ChatGPT can achieve an accuracy of approximately 80% when compared to MTurker annotations. Specifically, the model displays a more consistent classification for non-HOT comments than HOT comments compared to human annotations. Our findings also suggest that ChatGPT classifications align with provided HOT definitions, but ChatGPT classifies "hateful" and "offensive" as subsets of "toxic." Moreover, the choice of prompts used to interact with ChatGPT impacts its performance. Based on these in-sights, our study provides several meaningful implications for employing ChatGPT to detect HOT content, particularly regarding the reliability and consistency of its performance, its understand-ing and reasoning of the HOT concept, and the impact of prompts on its performance. Overall, our study provides guidance about the potential of using generative AI models to moderate large volumes of user-generated content on social media

    ํ˜์˜ค ๋ฐœ์–ธ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์˜ ๊ฑฐ์ง“ ์–‘์„ฑ ํŽธํ–ฅ ์ง„๋‹จ ๋ฐ ๊ฐœ์„  ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์ธ๋ฌธ๋Œ€ํ•™ ์–ธ์–ดํ•™๊ณผ, 2022.2. ์‹ ํšจํ•„.As the damage caused by hate speech in anonymous online spaces has been growing significantly, research on the detection of hate speech is being actively conducted. Recently, deep learning-based hate speech classifiers have shown great performance, but they tend to fail to generalize on out-of-domain data. I focus on the problem of False Positive detection and build adversarial tests sets of three different domains to diagnose this issue. I illustrate that a BERT-based classification model trained with existing Korean hate speech corpus exhibits False Positives due to over-sensitivity to specific words that have high correlations with hate speech in training datasets. Next, I present two different approaches to address the problem: a data-centric approach that adds data to correct the imbalance of training datasets and a model-centric approach that regularizes the model using post-hoc explanations. Both methods show improvement in reducing False Positives without compromising overall model quality. In addition, I show that strategically adding negative samples from a domain similar to a test set can be a cost-efficient way of greatly reducing false positives. Using Sampling and Occlusion (Jin et al., 2020) explanation, I qualitatively demonstrate that both approaches help model better utilize contextual information.์˜จ๋ผ์ธ ๋“ฑ ์ต๋ช… ๊ณต๊ฐ„์—์„œ์˜ ํ˜์˜ค ๋ฐœ์–ธ(Hate speech)์œผ๋กœ ์ธํ•œ ํ”ผํ•ด๊ฐ€ ์ปค์ ธ๊ฐ์— ๋”ฐ๋ผ, ํ˜์˜ค ๋ฐœ์–ธ ๋ถ„๋ฅ˜ ๋ฐ ๊ฒ€์ถœ์— ๊ด€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํžˆ ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค. ์ตœ๊ทผ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ํ˜์˜ค ๋ฐœ์–ธ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ  ์žˆ์ง€๋งŒ, ํ•™์Šต ๋„๋ฉ”์ธ ๋ฐ–(out-of-domain) ๋ฐ์ดํ„ฐ๋กœ ์ผ๋ฐ˜ํ™”ํ•จ์— ์žˆ์–ด์„œ๋Š” ์–ด๋ ค์›€์„ ๊ฒช๊ณ  ์žˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ๋ชจ๋ธ์ด ๊ฑฐ์ง“ ์–‘์„ฑ(False Positive)์„ ๊ฒ€์ถœํ•ด๋‚ด๋Š” ๋ฌธ์ œ์— ์ดˆ์ ์„ ๋‘๊ณ , ํ•ด๋‹น ๋ฌธ์ œ๋ฅผ ์ง„๋‹จํ•˜๊ธฐ ์œ„ํ•ด ์„ธ ๊ฐ€์ง€ ์„œ๋กœ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์˜(domain)์˜ ๋Œ€๋ฆฝ์ (adversarial) ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํ…Œ์ŠคํŠธ์…‹์„ ๋งŒ๋“ ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ธฐ์กด์˜ ํ•œ๊ตญ์–ด ํ˜์˜ค ํ‘œํ˜„ ๋ฐ์ดํ„ฐ์…‹์„ ํ•™์Šตํ•œ BERT ๊ธฐ๋ฐ˜์˜ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์ด ํ•™์Šต ๋ฐ์ดํ„ฐ ์ƒ์—์„œ ํ˜์˜ค ํ‘œํ˜„๊ณผ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๋Š” ํŠน์ • ๋‹จ์–ด๋“ค์— ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ฐ˜์‘ํ•˜์—ฌ ๊ฑฐ์ง“ ์–‘์„ฑ(False Positive) ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ํ˜„์ƒ์„ ๋ณด์ธ๋‹ค. ๋‹ค์Œ์œผ๋กœ, ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์˜ ๋ถˆ๊ท ํ˜•์„ ์ˆ˜์ •ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ค‘์ (data-centric) ๋ฐฉ๋ฒ•๊ณผ ํŠน์ • ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ ๋ชจ๋ธ์˜ ์‚ฌํ›„ ์„ค๋ช…(post-hoc explanation)์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์ •๊ทœํ™”(regularize) ํ•˜๋Š” ๋ชจ๋ธ ์ค‘์ (model-centric) ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•˜๊ณ , ๋‘ ์ ‘๊ทผ ๋ฐฉ๋ฒ• ๋ชจ๋‘ ์ „๋ฐ˜์ ์ธ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ํ•ด์น˜์ง€ ์•Š์œผ๋ฉฐ ๊ฑฐ์ง“ ์–‘์„ฑ์˜ ๋น„์œจ์„ ์ค„์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. ๋˜ํ•œ, ํ…Œ์ŠคํŠธ ๋„๋ฉ”์ธ์˜ ํŠน์„ฑ์„ ์•Œ๊ณ  ์žˆ์„ ๊ฒฝ์šฐ, ์œ ์‚ฌํ•œ ๋„๋ฉ”์ธ์—์„œ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ๋ถˆ๊ท ํ˜• ์ˆ˜์ •์„ ์œ„ํ•œ ์ƒ˜ํ”Œ ์ถ”๊ฐ€๋ฅผ ํ†ตํ•ด ์ ์€ ๋น„์šฉ์œผ๋กœ ๋ชจ๋ธ์˜ ๊ฑฐ์ง“์–‘์„ฑ์„ ํฐ ํญ์œผ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. ๋˜ํ•œ, Samping and Occlusion (Jin et al., 2020) ์„ค๋ช…์„ ํ†ตํ•ด ๋‘ ์ ‘๊ทผ ๋ฐฉ์‹ ๋ชจ๋‘์—์„œ ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ๋” ์ž˜ ํ™œ์šฉํ•˜๊ฒŒ ๋จ์„ ์ •์„ฑ์ ์œผ๋กœ ํ™•์ธํ•œ๋‹ค.ABSTRACT I TABLE OF CONTENTS III LIST OF FIGURES IV LIST OF TABLES V CHAPTER 1. INTRODUCTION 1 1.1. HATE SPEECH DETECTION 1 1.2. FALSE POSITIVES IN HATE SPEECH DETECTION 4 1.3. PURPOSE OF RESEARCH 6 CHAPTER 2. BACKGROUND 9 2.1. DOMAIN ADAPTATION 9 2.2. MEASURING AND MITIGATING FALSE POSITIVE BIAS OF HATE SPEECH CLASSIFIER 10 2.2.1 Measuring Model bias on social identifiers 11 2.2.2 Mitigating Model bias on social identifiers 13 CHAPTER 3. DATASET 17 CHAPTER 4. QUANTIFYING BIAS 20 4.1 BASELINE MODEL 20 4.2 SELECTING NEUTRAL KEYWORDS 21 4.3 TEST DATASETS 26 4.4 QUANTIFYING BIAS OF THE BASELINE MODEL 31 CHAPTER 5. EXPERIMENTS 33 5.1 BIAS MITIGATION 33 5.1.1. Bias mitigation through train data augmentation 33 5.1.2. Model Regularization using SOC explanation 35 5.2 RESULT 36 5.2.1. Evaluation Metric 36 5.2.2. Experimental Results 36 5.2.3. Visualizing Effects of Mitigation 39 CHAPTER 6. CONCLUSION 46 REFERENCES 48 ๊ตญ๋ฌธ์ดˆ๋ก 52์„

    Toward Inclusive Online Environments: Counterfactual-Inspired XAI for Detecting and Interpreting Hateful and Offensive Tweets

    Get PDF
    The prevalence of hate speech and offensive language on social media platforms such as Twitter has significant consequences, ranging from psychological harm to the polarization of societies. Consequently, social media companies have implemented content moderation measures to curb harmful or discriminatory language. However, a lack of consistency and transparency hinders their ability to achieve desired outcomes. This article evaluates various ML models, including an ensemble, Explainable Boosting Machine (EBM), and Linear Support Vector Classifier (SVC), on a public dataset of 24,792 tweets by T. Davidson, categorizing tweets into three classes: hate, offensive, and neither. The top-performing model achieves a weighted F1-Score of 0.90. Furthermore, this article interprets the output of the best-performing model using LIME and SHAP, elucidating how specific words and phrases within a tweet contextually impact its classification. This analysis helps to shed light on the linguistic aspects of hate and offense. Additionally, we employ LIME to present a suggestive counterfactual approach, proposing no-hate alternatives for a tweet to further explain the influence of word choices in context. Limitations of the study include the potential for biased results due to dataset imbalance, which future research may address by exploring more balanced datasets or leveraging additional features. Ultimately, through these explanations, this work aims to promote digital literacy and foster an inclusive online environment that encourages informed and responsible use of digital technologies

    Hate Speech Detection in a mix of English and Hindi-English (Code-Mixed) Tweets

    Get PDF
    With the increasing usage of social networking platforms seen over recent years, there has been an extensive rise in hate speech usage between the users. Hence, Government and social media platforms face lots of responsibility and challenges to control, detect and eliminate massively growing hateful content as early as possible to prevent future criminal acts such as cyber violence and real-life hate crimes. Since Twitter is used globally by people from various backgrounds and nationalities, the platform contains tweets posted in different languages, including code-mixed language, namely Hindi-English. Due to the informal format of tweets with variations in spelling and grammar, hate speech detection is challenging, especially in code-mixed text containing a mixture of different languages. In this paper, we tackle the critical issue of hate speech on social media, with a focus on a mix of English and Hindi-English (code-mixed) text messages (tweets) on Twitter. We perform hate speech classification using the benefits of character-level embedding representations of tweets and Deep Neural Networks (DNN). We built two architectures, namely Convolutional Neural Network (CNN) and a combination of CNN and Long Short-Term Memory (LSTM) algorithms with character-level embedding as an improvement over Elouali et al. (2020)โ€™s work. Both the models were trained using an imbalanced (original) as well as oversampled (balanced) version of the training dataset and were evaluated on the test set. Extensive experimental analysis was performed by tuning the hyperparameters of our models and evaluating their performance in terms of accuracy, efficiency (runtime) and scalability in detecting whether a tweet is hate speech or non-hate. The performance of our proposed models is compared with Elouali et al. (2020)โ€™s model, and it is observed that our method has an improved accuracy and a significantly improved runtime and is scalable. Among our best performing models, CNN-LSTM performed slightly better than CNN with an accuracy of 88.97%

    Cyber bullying identification and tackling using natural language processing techniques

    Get PDF
    Abstract. As offensive content has a detrimental influence on the internet and especially in social media, there has been much research identifying cyberbullying posts from social media datasets. Previous works on this topic have overlooked the problems for cyberbullying categories detection, impact of feature choice, negation handling, and dataset construction. Indeed, many natural language processing (NLP) tasks, including cyberbullying detection in texts, lack comprehensive manually labeled datasets limiting the application of powerful supervised machine learning algorithms, including neural networks. Equally, it is challenging to collect large scale data for a particular NLP project due to the inherent subjectivity of labeling task and man-made effort. For this purpose, this thesis attempts to contribute to these challenges by the following. We first collected and annotated a multi-category cyberbullying (10K) dataset from the social network platform (ask.fm). Besides, we have used another publicly available cyberbullying labeled dataset, โ€™Formspringโ€™, for comparison purpose and ground truth establishment. We have devised a machine learning-based methodology that uses five distinct feature engineering and six different classifiers. The results showed that CNN classifier with Word-embedding features yielded a maximum performance amidst all state-of-art classifiers, with a detection accuracy of 93\% for AskFm and 92\% for FormSpring dataset. We have performed cyberbullying category detection, and CNN architecture still provide the best performance with 81\% accuracy and 78\% F1-score on average. Our second purpose was to handle the problem of lack of relevant cyberbullying instances in the training dataset through data augmentation. For this end, we developed an approach that makes use of wordsense disambiguation with WordNet-aided semantic expansion. The disambiguation and semantic expansion were intended to overcome several limitations of the social media (SM) posts/comments, such as unstructured content, limited semantic content, among others, while capturing equivalent instances induced by the wordsense disambiguation-based approach. We run several experiments and disambiguation/semantic expansion to estimate the impact of the classification performance using both original and the augmented datasets. Finally, we have compared the accuracy score for cyberbullying detection with some widely used classifiers before and after the development of datasets. The outcome supports the advantage of the data-augmentation strategy, which yielded 99\% of classifier accuracy, a 5\% improvement from the base score of 93\%. Our third goal related to negation handling was motivated by the intuitive impact of negation on cyberbullying statements and detection. Our proposed approach advocates a classification like technique by using NegEx and POS tagging that makes the use of a particular data design procedure for negation detection. Performances using the negation-handling approach and without negation handling are compared and discussed. The result showed a 95\% of accuracy for the negated handed dataset, which corresponds to an overall accuracy improvement of 2\% from the base score of 93\%. Our final goal was to develop a software tool using our machine learning models that will help to test our experiments and provide a real-life example of use case for both end-users and research communities. To achieve this objective, a python based web-application was developed and successfully tested
    • โ€ฆ
    corecore