8 research outputs found

    Comparative Studies of Detecting Abusive Language on Twitter

    Full text link
    The context-dependent nature of online aggression makes annotating large collections of data extremely difficult. Previously studied datasets in abusive language detection have been insufficient in size to efficiently train deep learning models. Recently, Hate and Abusive Speech on Twitter, a dataset much greater in size and reliability, has been released. However, this dataset has not been comprehensively studied to its potential. In this paper, we conduct the first comparative study of various learning models on Hate and Abusive Speech on Twitter, and discuss the possibility of using additional features and context data for improvements. Experimental results show that bidirectional GRU networks trained on word-level features, with Latent Topic Clustering modules, is the most accurate model scoring 0.805 F1.Comment: ALW2: 2nd Workshop on Abusive Language Online to be held at EMNLP 2018 (Brussels, Belgium), October 31st, 201

    A comparison of classification models to detect cyberbullying in the Peruvian Spanish language on twitter

    Get PDF
    Cyberbullying is a social problem in which bullies’ actions are more harmful than in traditional forms of bullying as they have the power to repeatedly humiliate the victim in front of an entire community through social media. Nowadays, multiple works aim at detecting acts of cyberbullying via the analysis of texts in social media publications written in one or more languages; however, few investigations target the cyberbullying detection in the Spanish language. In this work, we aim to compare four traditional supervised machine learning methods performances in detecting cyberbullying via the identification of four cyberbullying-related categories on Twitter posts written in the Peruvian Spanish language. Specifically, we trained and tested the Naive Bayes, Multinomial Logistic Regression, Support Vector Machines, and Random Forest classifiers upon a manually annotated dataset with the help of human participants. The results indicate that the best performing classifier for the cyberbullying detection task was the Support Vector Machine classifier

    A comparison of classification models to detect cyberbullying in the peruvian spanish language on Twitter

    Get PDF
    Cyberbullying is a social problem in which bullies’ actions are more harmful than in traditional forms of bullying as they have the power to repeatedly humiliate the victim in front of an entire community through social media. Nowadays, multiple works aim at detecting acts of cyberbullying via the analysis of texts in social media publications written in one or more languages; however, few investigations target the cyberbullying detection in the Spanish language. In this work, we aim to compare four traditional supervised machine learning methods performances in detecting cyberbullying via the identification of four cyberbullying-related categories on Twitter posts written in the Peruvian Spanish language. Specifically, we trained and tested the Naive Bayes, Multinomial Logistic Regression, Support Vector Machines, and Random Forest classifiers upon a manually annotated dataset with the help of human participants. The results indicate that the best performing classifier for the cyberbullying detection task was the Support Vector Machine classifier

    Improving hate speech detection using machine and deep learning techniques: A preliminary study

    Get PDF
    The increasing use of social media and information sharing has given major benefits to humanity. However, this has also given rise to a variety of challenges including the spreading and sharing of hate speech messages. Thus, to solve this emerging issue in social media, recent studies employed a variety of feature engineering techniques and machine learning or deep learning algorithms to automatically detect the hate speech messages on different datasets. However, most of the studies classify the hate speech related message using existing feature engineering approaches and suffer from the low classification results. This is because, the existing feature engineering approaches suffer from the word order problem and word context problem. In this research, identifying hateful content from latest tweets of twitter and classify them into several categories is studied. The categories identified are; Ethnicity, Nationality, Religion, Gender, Sexual Orientation, Disability and Other. These categories are further classified to identify the targets of hate speech such as Black, White, Asian belongs to Ethnicity and Muslims, Jews, Christians can be classified from Religion Category. An evaluation will be performed among the hateful content identified using deep learning model LSTM and traditional machine learning models which includes Linear SVC, Logistic Regression, Random Forest and Multinomial Nai¨ve Bayes to measure their accuracy and precision and their comparison on the live extracted tweets from twitter which will be used as our test dataset

    Operation Heron – Latent topic changes in an abusive letter series

    Get PDF
    The paper presents a two-part forensic linguistic analysis of an historic collection of abuse letters, sent to individuals in the public eye and individuals’ private homes between 2007-2009. We employ the technique of structural topic modelling (STM) to identify distinctions in the core topics of the letters, gauging the value of this relatively underused methodology in forensic linguistics. Four key topics were identified in the letters, Politics A and B, Healthcare, and Immigration, and their coherence, correlation and shifts in topic evaluated. Following the STM, a qualitative corpus linguistic analysis was undertaken, coding concordance lines according to topic, with the reliability between coders tested. This coding demonstrated that various connected statements within the same topic tend to gain or lose prevalence over time, and ultimately confirmed the consistency of content within the four topics identified through STM throughout the letter series. The discussion and conclusions to the paper reflect on the findings as well as considering the utility of these methodologies for linguistics and forensic linguistics in particular. The study demonstrates real value in revisiting a forensic linguistic dataset such as this to test and develop methodologies for the fiel

    Linguistic variation across Twitter and Twitter trolling

    Get PDF
    Trolling is used to label a variety of behaviours, from the spread of misinformation and hyperbole to targeted abuse and malicious attacks. Despite this, little is known about how trolling varies linguistically and what its major linguistic repertoires and communicative functions are in comparison to general social media posts. Consequently, this dissertation collects two corpora of tweets – a general English Twitter corpus and a Twitter trolling corpus using other Twitter users’ accusations – and introduces and applies a new short-text version of Multi-Dimensional Analysis to each corpus, which is designed to identify aggregated dimensions of linguistic variation across them. The analysis finds that trolling tweets and general tweets only differ on the final dimension of linguistic variation, but share the following linguistic repertoires: “Informational versus Interactive”, “Personal versus Other Description”, and “Promotional versus Oppositional”. Moreover, the analysis compares trolling tweets to general Twitter’s dimensions and finds that trolling tweets and general tweets are remarkably more similar than they are different in their distribution along all dimensions. These findings counter various theories on trolling and problematise the notion that trolling can be detected automatically using grammatical variation. Overall, this dissertation provides empirical evidence on how trolling and general tweets vary linguistically
    corecore