4 research outputs found

    Online multilingual hate speech detection: Experimenting with hindi and english social media

    Get PDF
    The last two decades have seen an exponential increase in the use of the Internet and social media, which has changed basic human interaction. This has led to many positive outcomes. At the same time, it has brought risks and harms. The volume of harmful content online, such as hate speech, is not manageable by humans. The interest in the academic community to investigate automated means for hate speech detection has increased. In this study, we analyse six publicly available datasets by combining them into a single homogeneous dataset. Having classified them into three classes, abusive, hateful or neither, we create a baseline model and improve model performance scores using various optimisation techniques. After attaining a competitive performance score, we create a tool that identifies and scores a page with an effective metric in near-real-time and uses the same feedback to re-train our model. We prove the competitive performance of our multilingual model in two languages, English and Hindi. This leads to comparable or superior performance to most monolingual models

    Hate Speech Detection in a mix of English and Hindi-English (Code-Mixed) Tweets

    Get PDF
    With the increasing usage of social networking platforms seen over recent years, there has been an extensive rise in hate speech usage between the users. Hence, Government and social media platforms face lots of responsibility and challenges to control, detect and eliminate massively growing hateful content as early as possible to prevent future criminal acts such as cyber violence and real-life hate crimes. Since Twitter is used globally by people from various backgrounds and nationalities, the platform contains tweets posted in different languages, including code-mixed language, namely Hindi-English. Due to the informal format of tweets with variations in spelling and grammar, hate speech detection is challenging, especially in code-mixed text containing a mixture of different languages. In this paper, we tackle the critical issue of hate speech on social media, with a focus on a mix of English and Hindi-English (code-mixed) text messages (tweets) on Twitter. We perform hate speech classification using the benefits of character-level embedding representations of tweets and Deep Neural Networks (DNN). We built two architectures, namely Convolutional Neural Network (CNN) and a combination of CNN and Long Short-Term Memory (LSTM) algorithms with character-level embedding as an improvement over Elouali et al. (2020)’s work. Both the models were trained using an imbalanced (original) as well as oversampled (balanced) version of the training dataset and were evaluated on the test set. Extensive experimental analysis was performed by tuning the hyperparameters of our models and evaluating their performance in terms of accuracy, efficiency (runtime) and scalability in detecting whether a tweet is hate speech or non-hate. The performance of our proposed models is compared with Elouali et al. (2020)’s model, and it is observed that our method has an improved accuracy and a significantly improved runtime and is scalable. Among our best performing models, CNN-LSTM performed slightly better than CNN with an accuracy of 88.97%

    Tackling Sexist Hate Speech: Cross-Lingual Detection and Multilingual Insights from Social Media

    Get PDF
    With the widespread use of social media, the proliferation of online communication presents both opportunities and challenges for fostering a respectful and inclusive digital environment. Due to the anonymity and weak regulations of social media platforms, the rise of hate speech has become a significant concern, particularly against specific individuals or groups based on race, religion, ethnicity, or gender, posing a severe threat to human rights. Sexist hate speech is a prevalent form of online hate that often manifests itself through gender-based violence and discrimination, challenging societal norms and legal systems. Despite the advances in natural language processing techniques for detecting offensive and sexist content, most research still focuses on monolingual (primarily English) contexts, neglecting the multilingual nature of online platforms. This gap highlights the need for effective and scalable strategies to address the linguistic diversity and cultural variations in hate speech. Cross-language transfer learning and state-of-the-art multilingual pre-trained language models provide potential solutions to improve the detection efficiency of low-resource languages by leveraging data from high-resource languages. Additional knowledge is crucial to facilitate the models’ performance in detecting culturally varying expressions of sexist hate speech in different languages. In this thesis, we delve into the complex area of identifying sexist hate speech in social media across diverse languages pertaining to different language families, with a focus on sexism and a broad exploration of datasets, methodologies, and barriers inherent in mitigating online hate speech in cross-lingual and multilingual scenarios. We primarily apply cross-lingual transfer learning techniques to detect sexist hate speech, aiming to leverage knowledge acquired from related linguistic data in order to improve performance in a target language. We also investigate the integration of external knowledge to deepen the understanding of sexism in multilingual social media contexts, addressing both the challenges of linguistic diversity and the need for comprehensive, culturally sensitive hate speech detection models. Specifically, it embarks on a comprehensive survey of tackling cross-lingual hate speech online, summarising existing datasets and cross-lingual approaches, as well as highlighting challenges and frontiers in this field. It then presents a first contribution to the field, the creation of the Sina Weibo Sexism Review (Swsr) dataset in Chinese —a pioneering resource that not only fills a crucial gap in limited resources but also lays the foundation for relevant cross-lingual investigations. Additionally, it examines how cross-lingual techniques can be utilised to generate domain-aware word embeddings, and explores the application of these embeddings in a cross-lingual hate speech framework, thereby enhancing the capacity to capture the subtleties of sexist hate speech across diverse languages. Recognising the significance of linguistic nuances in multilingual and cross-lingual settings, another innovation consists in proposing and evaluating a series of multilingual and cross-lingual models tailored for detecting sexist hate speech. By leveraging the capacity of shared knowledge and features across languages, these models significantly advance the state-of-the-art in identifying online sexist hate speech. As societies continue to deal with the complexities of social media, the findings and methodologies presented in this thesis could effectively help foster more inclusive and respectful online content across languages
    corecore