Search CORE

348 research outputs found

Towards multidomain and multilingual abusive language detection: a survey

Author: Basile V.
Pamungkas E. W.
Patti V.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Institutional Research Information System University of Turin

Relationship Between Personality Patterns and Harmfulness : Analysis and Prediction Based on Sentence Embedding

Author: Hirobayashi Tomoki
Kishima Ryota
Kita Kenji
Matsumoto Kazuyuki
Tsuchiya Seiji
Yoshida Minoru
Publication venue: 'IGI Global'
Publication date: 28/02/2022
Field of study

This paper hypothesizes that harmful utterances need to be judged in the context of whole sentences, and the authors extract features of harmful expressions using a general-purpose language model. Based on the extracted features, the authors propose a method to predict the presence or absence of harmful categories. In addition, the authors believe that it is possible to analyze users who incite others by combining this method with research on analyzing the personality of the speaker from statements on social networking sites. The results confirmed that the proposed method can judge the possibility of harmful comments with higher accuracy than simple dictionary-based models or models using a distributed representation of words. The relationship between personality patterns and harmful expressions was also confirmed by an analysis based on a harmful judgment model

Tokushima University Institutional Repository

Resources and benchmark corpora for hate speech detection: a systematic review

Author: Basile Valerio
Bosco Cristina
Patti Viviana
Poletto Fabio
Sanguinetti Manuela
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Institutional Research Information System University of Turin

Dynamics of online hate and misinformation

Author: Cinelli Matteo
Mozetič Igor
Novak Petra Kralj
Pelicon Andraž
Quattrociocchi Walter
Zollo Fabiana
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Online debates are often characterised by extreme polarisation and heated discussions among users. The presence of hate speech online is becoming increasingly problematic, making necessary the development of appropriate countermeasures. In this work, we perform hate speech detection on a corpus of more than one million comments on YouTube videos through a machine learning model, trained and fine-tuned on a large set of hand-annotated data. Our analysis shows that there is no evidence of the presence of “pure haters”, meant as active users posting exclusively hateful comments. Moreover, coherently with the echo chamber hypothesis, we find that users skewed towards one of the two categories of video channels (questionable, reliable) are more prone to use inappropriate, violent, or hateful language within their opponents’ community. Interestingly, users loyal to reliable sources use on average a more toxic language than their counterpart. Finally, we find that the overall toxicity of the discussion increases with its length, measured both in terms of the number of comments and time. Our results show that, coherently with Godwin’s law, online debates tend to degenerate towards increasingly toxic exchanges of views

Directory of Open Access Journals

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Recommended from our members

Identifying and Processing Crisis Information from Social Media

Author: Khare Prashant
Publication venue
Publication date: 26/02/2020
Field of study

Social media platforms play a crucial role in how people communicate, particularly during crisis situations such as natural disasters. People share and disseminate information on social media platforms that relates to updates, alerts, rescue and relief requests among other crisis relevant information. Hurricane Harvey and Hurricane Sandy saw over tens of millions of posts getting generated, on Twitter, in a short span of time. The ambit of such posts spreads across a wide range such as personal and official communications, and citizen sensing, to mention a few. This makes social media platforms a source of vital information to different stakeholders in crisis situations such as impacted communities, relief agencies, and civic authorities. However, the overwhelming volume of data generated during such times, makes it impossible to manually identify information relevant to crisis. Additionally, a large portion of posts in voluminous streams is not relevant or bears minimal relevance to crisis situations. This has steered much research towards exploring methods that can automatically identify crisis relevant information from voluminous streams of data during such scenarios. However, the problem of identifying crisis relevant information from social media platforms, such as Twitter, is not trivial given the nature of unstructured text such as short text length and syntactic variations among other challenges. A key objective, while creating automatic crisis relevancy classification systems, is to make them adaptable to a wide range of crisis types and languages. Many related approaches rely on statistical features which are quantifiable properties and linguistic properties of the text. A general approach is to train the classification model on labelled data acquired from crisis events and evaluate on other crisis events. A key aspect missing from explored literature is the validity of crisis relevancy classification models when applied to data from unseen types of crisis events and languages. For instance, how would the accuracy of a crisis relevancy classification model, trained on earthquake type of events, change when applied to flood type of events. Or, how would a model perform when trained on crisis data in English but applied to data in Italian. This thesis investigates these problems from a semantics perspective, where the challenges posed by diverse types of crisis and language variations are seen as the problems that can be tackled by enriching the data semantically. The use of knowledge bases such as DBpedia, BabelNet, and Wikipedia, for semantic enrichment of data in text classification problems has often been studied. Semantic enrichment of data through entity linking and expansion of context via knowledge bases can take advantage of connections between different concepts and thus enhance contextual coherency across crisis types and languages. Several previous works have focused on similar problems and proposed approaches using statistical features and/or non-semantic features. The use of semantics extracted through knowledge graphs has remained unexplored in building crisis relevancy classifiers that are adaptive to varying crisis types and multilingual data. Experiments conducted in this thesis consider data from Twitter, a micro-blogging social media platform, and analyse multiple aspects of crisis data classification. The results obtained through various analyses in this thesis demonstrate the value of semantic enrichment of text through knowledge graphs in improving the adaptability of crisis relevancy classifiers across crisis types and languages, in comparison to statistical features as often used in much of the related work

Open Research Online (The Open University)

Data analytics 2016: proceedings of the fifth international conference on data analytics

Author: Bhulai Sandjai
Semanjski Ivana
Publication venue: The International Academy, Research and Industry Association
Publication date: 01/01/2016
Field of study

VU Research Portal

Ghent University Academic Bibliography

Sentiment polarity shifters : creating lexical resources through manual annotation and bootstrapped machine learning

Author: Schulder Marc
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2019
Field of study

Alleviating pain is good and abandoning hope is bad. We instinctively understand how words like "alleviate" and "abandon" affect the polarity of a phrase, inverting or weakening it. When these words are content words, such as verbs, nouns and adjectives, we refer to them as polarity shifters. Shifters are a frequent occurrence in human language and an important part of successfully modeling negation in sentiment analysis; yet research on negation modeling has focussed almost exclusively on a small handful of closed class negation words, such as "not", "no" and "without. A major reason for this is that shifters are far more lexically diverse than negation words, but no resources exist to help identify them. We seek to remedy this lack of shifter resources. Our most central step towards this is the creation of a large lexicon of polarity shifters that covers verbs, nouns and adjectives. To reduce the prohibitive cost of such a large annotation task, we develop a bootstrapping approach that combines automatic classification with human verification. This ensures the high quality of our lexicon while reducing annotation cost by over 70%. In designing the bootstrap classifier we develop a variety of features which use both existing semantic resources and linguistically informed text patterns. In addition we investigate how knowledge about polarity shifters might be shared across different parts of speech, highlighting both the potential and limitations of such an approach. The applicability of our bootstrapping approach extends beyond the creation of a single resource. We show how it can further be used to introduce polarity shifter resources for other languages. Through the example case of German we show that all our features are transferable to other languages. Keeping in mind the requirements of under-resourced languages, we also explore how well a classifier would do when relying only on data- but not resource-driven features. We also introduce ways to use cross-lingual information, leveraging the shifter resources we previously created for other languages. Apart from the general question of which words can be polarity shifters, we also explore a number of other factors. One of these is the matter of shifting directions, which indicates whether a shifter affects positive polarities, negative polarities or whether it can shift in either direction. Using a supervised classifier we add shifting direction information to our bootstrapped lexicon. For other aspects of polarity shifting, manual annotation is preferable to automatic classification. Not every word that can cause polarity shifting does so for every of its word senses. As word sense disambiguation technology is not robust enough to allow the automatic handling of such nuances, we manually create a complete sense-level annotation of verbal polarity shifters. To verify the usefulness of the lexica which we create, we provide an extrinsic evaluation in which we apply them to a sentiment analysis task. In this task the different lexica are not only compared amongst each other, but also against a state-of-the-art compositional polarity neural network classifier that has been shown to be able to implicitly learn the negating effect of negation words from a training corpus. However, we find that the same is not true for the far more lexically diverse polarity shifters. Instead, the use of the explicit knowledge provided by our shifter lexica brings clear gains in performance.Deutsche Forschungsgesellschaf

Universaar

Acronym

Offensive Language Detection in Tweets Using Machine Learning Methods

Author: Christodoulou Christina
Χριστοδούλου Χριστίνα
Publication venue
Publication date: 01/01/2022
Field of study

Αναμφίβολα, η προσβλητική γλώσσα έχει γίνει διαδεδομένη στα μέσα κοινωνικής δικτύωσης τα τελευταία χρόνια λόγω της αυξανόμενης δημοτικότητάς τους. Ο αυξανόμενος αριθμός χρηστών που τείνουν να δημοσιεύουν προσβλητικό περιεχόμενο στοχεύοντας σε άτομα ή ομάδες επιφέρει σοβαρές επιπτώσεις όχι μόνο στην ευημερία των ατόμων, αλλά και στην ίδια την κοινωνία. Το γεγονός αυτό έχει προκαλέσει ανησυχία στις κυβερνήσεις, στις εταιρείες μέσων κοινωνικής δικτύωσης, αλλά και στις ακαδημαϊκές και κοινωνικές κοινότητες, οι οποίες έχουν καταβάλει συντονισμένες προσπάθειες για τον περιορισμό διάδοσης της προσβλητικής γλώσσας στο διαδίκτυο και τη δημιουργία ενός ασφαλέστερου διαδικτυακού χώρου. Ωστόσο, παρά τις προσπάθειές τους, η ανάγκη ταχείας επεξεργασίας ογκώδους πληροφορίας για τον εντοπισμό και την αναφορά προσβλητικής γλώσσας έχει καταστήσει την ανάπτυξη συστημάτων μηχανικής μάθησης κάτι παραπάνω από επιτακτική. Συνεπώς, στην παρούσα διπλωματική εργασία, εισάγονται τρία διαφορετικά μοντέλα μηχανικής μάθησης, τα οποία εκτελούν δυαδική ταξινόμηση κειμένου, για τον εντοπισμό προσβλητικής γλώσσας σε αγγλικά δημοσιεύματα κειμένων από το Twitter. Τα προτεινόμενα μοντέλα, τα οποία αποτελούνται από δύο απλούς ταξινομητές και ένα Bidirectional Stacked LSTM, αξιοποιούν τα contextual embeddings που προέρχονται από το BERTLARGE-Uncased με fine-tuning του σε τέσσερα σύνολα δεδομένων εκπαίδευσης συγκεντρωμένα σε ένα. Η διαδικασία προετοιμασίας των δεδομένων περιλαμβάνει καθαρισμό και προ-επεξεργασία των δεδομένων, καθώς και υποδειγματοληψίας των δεδομένων για την αντιμετώπιση της ανισορροπίας των κλάσεων. Η αποτελεσματικότητα των προτεινόμενων μεθόδων αξιολογείται σε δύο διαθέσιμα σύνολα δεδομένων αξιολόγησης, τα OLID 2019 και OLID 2020, με βάση έξι μετρικές, καθώς και τις καμπύλες μάθησης της απώλειας και της ακρίβειας. Η συγκριτική ανάλυση μεταξύ αυτών των μεθόδων αποδεικνύει ότι η συνένωση των τεσσάρων τελευταίων κρυφών επιπέδων του BERT που περνούν σε έναν ταξινομητή υπερτερεί των άλλων μοντέλων επιτυγχάνοντας 77,8% και 86,8% Macro-F1 σκορ στα δύο σύνολα δεδομένων αξιολόγησης αντίστοιχα. Η σύγκριση με προηγούμενες συναφείς μεθόδους αποκαλύπτει ότι, μολονότι τα αποτελέσματα είναι ικανοποιητικά, υπάρχουν περιθώρια για περισσότερο πειραματισμό και βελτίωση στο μέλλον.Undoubtedly, offensive language has become ubiquitous in social media over the last years due to the increasing popularity of social media platforms. The growing number of users that tend to post offensive content targeting individuals or groups has led to significant repercussions not only for the well-being of the targets, but also for society itself. This has raised concern in governments, social media companies as well as academic and social communities, who have made concerted efforts to curb the dissemination of offensive language online and create a safer digital space. Nevertheless, despite their endeavors, the need to rapidly process huge amounts of content in order to detect and report offensive language has made the development of machine learning systems more than imperative. Consequently, in the present thesis, three different machine learning models, which perform binary text classification, are introduced to detect offensive language in English texts from Twitter. The proposed models, which constitute two simple classifiers and a Bidirectional Stacked LSTM, utilize contextual embeddings pooled from BERTLARGE-Uncased by fine-tuning its various layers on four training datasets combined in one. The data preparation process involved data cleaning and preprocessing as well as data down-sampling to handle class imbalance. The effectiveness of the proposed methods is evaluated on two available test sets, OLID 2019 and OLID 2020, based on six metrics, the learning curves of loss and accuracy as well. Comparative analysis between those methods demonstrates that the concatenation of the last four hidden layers of BERT fed in a classifier outperforms the other models by achieving 77.8% and 86.8% Macro-F1 scores on the two test sets respectively. Comparison with previous related methods indicates that, although the results are satisfactory, there is room for further experimentation and improvement in the future

Pergamos : Unified Institutional Repository / Digital Library Platform of the National and Kapodistrian University of Athens

Fake News in the era of online intentional misinformation ; a review of existing approaches

Author: Ράπτη Ματίνα
Publication venue: Πανεπιστήμιο Μακεδονίας
Publication date
Field of study

Διπλωματική εργασία--Πανεπιστήμιο Μακεδονίας, Θεσσαλονίκη, 2019.Fake news is probably one of the most discussed issues of the past years. The term has acquired greater legitimacy after being named the word of the year by Collins Dictionary, following what the dictionary called its “ubiquitous presence” over the year 2017. However, the fake news issue has not been yet deeply researched. Therefore, in this thesis, definitions by the literature about the term “fake news” are gathered and through them, specific characteristics and criteria are extracted in order to verify the exact elements of false news and intentional misinformation in general. This study aims to identify eventually as thoroughly as possible what fake news is and what is not. For that purpose, through qualitative research, the total features of the term are exhibited and analyzed concluding in the classification of characteristics most of the fake news incidents present. Following the proposed feature identification is examined through specific fake news case studies. Finally, after understanding deeper and verifying specific characteristics that appear on the nature of fake news detection and mitigation actions are proposed, demonstrating the need for technological development on the issue and educational evolution on digital skills of the public, accomplishing an inclusive review of a less studied term, such as the fake news. At last conclusions are presented leading to the main remark of the current thesis, namely the need for further quantitative and statistical research as much as deeper theoretical study, to better decipher the issue of fake news and thus resolve it

Corpus and sentiment analysis

Author: Cheng Tai Wai David
Publication venue
Publication date: 01/01/2007
Field of study

EThOS - Electronic Theses Online ServiceGBUnited Kingdo

Surrey Research Insight

OpenGrey Repository