798 research outputs found

    Hate Speech Detection in a mix of English and Hindi-English (Code-Mixed) Tweets

    Get PDF
    With the increasing usage of social networking platforms seen over recent years, there has been an extensive rise in hate speech usage between the users. Hence, Government and social media platforms face lots of responsibility and challenges to control, detect and eliminate massively growing hateful content as early as possible to prevent future criminal acts such as cyber violence and real-life hate crimes. Since Twitter is used globally by people from various backgrounds and nationalities, the platform contains tweets posted in different languages, including code-mixed language, namely Hindi-English. Due to the informal format of tweets with variations in spelling and grammar, hate speech detection is challenging, especially in code-mixed text containing a mixture of different languages. In this paper, we tackle the critical issue of hate speech on social media, with a focus on a mix of English and Hindi-English (code-mixed) text messages (tweets) on Twitter. We perform hate speech classification using the benefits of character-level embedding representations of tweets and Deep Neural Networks (DNN). We built two architectures, namely Convolutional Neural Network (CNN) and a combination of CNN and Long Short-Term Memory (LSTM) algorithms with character-level embedding as an improvement over Elouali et al. (2020)’s work. Both the models were trained using an imbalanced (original) as well as oversampled (balanced) version of the training dataset and were evaluated on the test set. Extensive experimental analysis was performed by tuning the hyperparameters of our models and evaluating their performance in terms of accuracy, efficiency (runtime) and scalability in detecting whether a tweet is hate speech or non-hate. The performance of our proposed models is compared with Elouali et al. (2020)’s model, and it is observed that our method has an improved accuracy and a significantly improved runtime and is scalable. Among our best performing models, CNN-LSTM performed slightly better than CNN with an accuracy of 88.97%

    Online multilingual hate speech detection: Experimenting with hindi and english social media

    Get PDF
    The last two decades have seen an exponential increase in the use of the Internet and social media, which has changed basic human interaction. This has led to many positive outcomes. At the same time, it has brought risks and harms. The volume of harmful content online, such as hate speech, is not manageable by humans. The interest in the academic community to investigate automated means for hate speech detection has increased. In this study, we analyse six publicly available datasets by combining them into a single homogeneous dataset. Having classified them into three classes, abusive, hateful or neither, we create a baseline model and improve model performance scores using various optimisation techniques. After attaining a competitive performance score, we create a tool that identifies and scores a page with an effective metric in near-real-time and uses the same feedback to re-train our model. We prove the competitive performance of our multilingual model in two languages, English and Hindi. This leads to comparable or superior performance to most monolingual models

    Uncovering Political Hate Speech During Indian Election Campaign: A New Low-Resource Dataset and Baselines

    Full text link
    The detection of hate speech in political discourse is a critical issue, and this becomes even more challenging in low-resource languages. To address this issue, we introduce a new dataset named IEHate, which contains 11,457 manually annotated Hindi tweets related to the Indian Assembly Election Campaign from November 1, 2021, to March 9, 2022. We performed a detailed analysis of the dataset, focusing on the prevalence of hate speech in political communication and the different forms of hateful language used. Additionally, we benchmark the dataset using a range of machine learning, deep learning, and transformer-based algorithms. Our experiments reveal that the performance of these models can be further improved, highlighting the need for more advanced techniques for hate speech detection in low-resource languages. In particular, the relatively higher score of human evaluation over algorithms emphasizes the importance of utilizing both human and automated approaches for effective hate speech moderation. Our IEHate dataset can serve as a valuable resource for researchers and practitioners working on developing and evaluating hate speech detection techniques in low-resource languages. Overall, our work underscores the importance of addressing the challenges of identifying and mitigating hate speech in political discourse, particularly in the context of low-resource languages. The dataset and resources for this work are made available at https://github.com/Farhan-jafri/Indian-Election.Comment: Accepted to ICWSM Workshop (MEDIATE
    • …
    corecore