348 research outputs found

    Relationship Between Personality Patterns and Harmfulness : Analysis and Prediction Based on Sentence Embedding

    Get PDF
    This paper hypothesizes that harmful utterances need to be judged in the context of whole sentences, and the authors extract features of harmful expressions using a general-purpose language model. Based on the extracted features, the authors propose a method to predict the presence or absence of harmful categories. In addition, the authors believe that it is possible to analyze users who incite others by combining this method with research on analyzing the personality of the speaker from statements on social networking sites. The results confirmed that the proposed method can judge the possibility of harmful comments with higher accuracy than simple dictionary-based models or models using a distributed representation of words. The relationship between personality patterns and harmful expressions was also confirmed by an analysis based on a harmful judgment model

    Dynamics of online hate and misinformation

    Get PDF
    Online debates are often characterised by extreme polarisation and heated discussions among users. The presence of hate speech online is becoming increasingly problematic, making necessary the development of appropriate countermeasures. In this work, we perform hate speech detection on a corpus of more than one million comments on YouTube videos through a machine learning model, trained and fine-tuned on a large set of hand-annotated data. Our analysis shows that there is no evidence of the presence of “pure haters”, meant as active users posting exclusively hateful comments. Moreover, coherently with the echo chamber hypothesis, we find that users skewed towards one of the two categories of video channels (questionable, reliable) are more prone to use inappropriate, violent, or hateful language within their opponents’ community. Interestingly, users loyal to reliable sources use on average a more toxic language than their counterpart. Finally, we find that the overall toxicity of the discussion increases with its length, measured both in terms of the number of comments and time. Our results show that, coherently with Godwin’s law, online debates tend to degenerate towards increasingly toxic exchanges of views

    Data analytics 2016: proceedings of the fifth international conference on data analytics

    Get PDF

    Sentiment polarity shifters : creating lexical resources through manual annotation and bootstrapped machine learning

    Get PDF
    Alleviating pain is good and abandoning hope is bad. We instinctively understand how words like "alleviate" and "abandon" affect the polarity of a phrase, inverting or weakening it. When these words are content words, such as verbs, nouns and adjectives, we refer to them as polarity shifters. Shifters are a frequent occurrence in human language and an important part of successfully modeling negation in sentiment analysis; yet research on negation modeling has focussed almost exclusively on a small handful of closed class negation words, such as "not", "no" and "without. A major reason for this is that shifters are far more lexically diverse than negation words, but no resources exist to help identify them. We seek to remedy this lack of shifter resources. Our most central step towards this is the creation of a large lexicon of polarity shifters that covers verbs, nouns and adjectives. To reduce the prohibitive cost of such a large annotation task, we develop a bootstrapping approach that combines automatic classification with human verification. This ensures the high quality of our lexicon while reducing annotation cost by over 70%. In designing the bootstrap classifier we develop a variety of features which use both existing semantic resources and linguistically informed text patterns. In addition we investigate how knowledge about polarity shifters might be shared across different parts of speech, highlighting both the potential and limitations of such an approach. The applicability of our bootstrapping approach extends beyond the creation of a single resource. We show how it can further be used to introduce polarity shifter resources for other languages. Through the example case of German we show that all our features are transferable to other languages. Keeping in mind the requirements of under-resourced languages, we also explore how well a classifier would do when relying only on data- but not resource-driven features. We also introduce ways to use cross-lingual information, leveraging the shifter resources we previously created for other languages. Apart from the general question of which words can be polarity shifters, we also explore a number of other factors. One of these is the matter of shifting directions, which indicates whether a shifter affects positive polarities, negative polarities or whether it can shift in either direction. Using a supervised classifier we add shifting direction information to our bootstrapped lexicon. For other aspects of polarity shifting, manual annotation is preferable to automatic classification. Not every word that can cause polarity shifting does so for every of its word senses. As word sense disambiguation technology is not robust enough to allow the automatic handling of such nuances, we manually create a complete sense-level annotation of verbal polarity shifters. To verify the usefulness of the lexica which we create, we provide an extrinsic evaluation in which we apply them to a sentiment analysis task. In this task the different lexica are not only compared amongst each other, but also against a state-of-the-art compositional polarity neural network classifier that has been shown to be able to implicitly learn the negating effect of negation words from a training corpus. However, we find that the same is not true for the far more lexically diverse polarity shifters. Instead, the use of the explicit knowledge provided by our shifter lexica brings clear gains in performance.Deutsche Forschungsgesellschaf

    Offensive Language Detection in Tweets Using Machine Learning Methods

    Get PDF
    Αναμφίβολα, η προσβλητική γλώσσα έχει γίνει διαδεδομένη στα μέσα κοινωνικής δικτύωσης τα τελευταία χρόνια λόγω της αυξανόμενης δημοτικότητάς τους. Ο αυξανόμενος αριθμός χρηστών που τείνουν να δημοσιεύουν προσβλητικό περιεχόμενο στοχεύοντας σε άτομα ή ομάδες επιφέρει σοβαρές επιπτώσεις όχι μόνο στην ευημερία των ατόμων, αλλά και στην ίδια την κοινωνία. Το γεγονός αυτό έχει προκαλέσει ανησυχία στις κυβερνήσεις, στις εταιρείες μέσων κοινωνικής δικτύωσης, αλλά και στις ακαδημαϊκές και κοινωνικές κοινότητες, οι οποίες έχουν καταβάλει συντονισμένες προσπάθειες για τον περιορισμό διάδοσης της προσβλητικής γλώσσας στο διαδίκτυο και τη δημιουργία ενός ασφαλέστερου διαδικτυακού χώρου. Ωστόσο, παρά τις προσπάθειές τους, η ανάγκη ταχείας επεξεργασίας ογκώδους πληροφορίας για τον εντοπισμό και την αναφορά προσβλητικής γλώσσας έχει καταστήσει την ανάπτυξη συστημάτων μηχανικής μάθησης κάτι παραπάνω από επιτακτική. Συνεπώς, στην παρούσα διπλωματική εργασία, εισάγονται τρία διαφορετικά μοντέλα μηχανικής μάθησης, τα οποία εκτελούν δυαδική ταξινόμηση κειμένου, για τον εντοπισμό προσβλητικής γλώσσας σε αγγλικά δημοσιεύματα κειμένων από το Twitter. Τα προτεινόμενα μοντέλα, τα οποία αποτελούνται από δύο απλούς ταξινομητές και ένα Bidirectional Stacked LSTM, αξιοποιούν τα contextual embeddings που προέρχονται από το BERTLARGE-Uncased με fine-tuning του σε τέσσερα σύνολα δεδομένων εκπαίδευσης συγκεντρωμένα σε ένα. Η διαδικασία προετοιμασίας των δεδομένων περιλαμβάνει καθαρισμό και προ-επεξεργασία των δεδομένων, καθώς και υποδειγματοληψίας των δεδομένων για την αντιμετώπιση της ανισορροπίας των κλάσεων. Η αποτελεσματικότητα των προτεινόμενων μεθόδων αξιολογείται σε δύο διαθέσιμα σύνολα δεδομένων αξιολόγησης, τα OLID 2019 και OLID 2020, με βάση έξι μετρικές, καθώς και τις καμπύλες μάθησης της απώλειας και της ακρίβειας. Η συγκριτική ανάλυση μεταξύ αυτών των μεθόδων αποδεικνύει ότι η συνένωση των τεσσάρων τελευταίων κρυφών επιπέδων του BERT που περνούν σε έναν ταξινομητή υπερτερεί των άλλων μοντέλων επιτυγχάνοντας 77,8% και 86,8% Macro-F1 σκορ στα δύο σύνολα δεδομένων αξιολόγησης αντίστοιχα. Η σύγκριση με προηγούμενες συναφείς μεθόδους αποκαλύπτει ότι, μολονότι τα αποτελέσματα είναι ικανοποιητικά, υπάρχουν περιθώρια για περισσότερο πειραματισμό και βελτίωση στο μέλλον.Undoubtedly, offensive language has become ubiquitous in social media over the last years due to the increasing popularity of social media platforms. The growing number of users that tend to post offensive content targeting individuals or groups has led to significant repercussions not only for the well-being of the targets, but also for society itself. This has raised concern in governments, social media companies as well as academic and social communities, who have made concerted efforts to curb the dissemination of offensive language online and create a safer digital space. Nevertheless, despite their endeavors, the need to rapidly process huge amounts of content in order to detect and report offensive language has made the development of machine learning systems more than imperative. Consequently, in the present thesis, three different machine learning models, which perform binary text classification, are introduced to detect offensive language in English texts from Twitter. The proposed models, which constitute two simple classifiers and a Bidirectional Stacked LSTM, utilize contextual embeddings pooled from BERTLARGE-Uncased by fine-tuning its various layers on four training datasets combined in one. The data preparation process involved data cleaning and preprocessing as well as data down-sampling to handle class imbalance. The effectiveness of the proposed methods is evaluated on two available test sets, OLID 2019 and OLID 2020, based on six metrics, the learning curves of loss and accuracy as well. Comparative analysis between those methods demonstrates that the concatenation of the last four hidden layers of BERT fed in a classifier outperforms the other models by achieving 77.8% and 86.8% Macro-F1 scores on the two test sets respectively. Comparison with previous related methods indicates that, although the results are satisfactory, there is room for further experimentation and improvement in the future

    Fake News in the era of online intentional misinformation ; a review of existing approaches

    Get PDF
    Διπλωματική εργασία--Πανεπιστήμιο Μακεδονίας, Θεσσαλονίκη, 2019.Fake news is probably one of the most discussed issues of the past years. The term has acquired greater legitimacy after being named the word of the year by Collins Dictionary, following what the dictionary called its “ubiquitous presence” over the year 2017. However, the fake news issue has not been yet deeply researched. Therefore, in this thesis, definitions by the literature about the term “fake news” are gathered and through them, specific characteristics and criteria are extracted in order to verify the exact elements of false news and intentional misinformation in general. This study aims to identify eventually as thoroughly as possible what fake news is and what is not. For that purpose, through qualitative research, the total features of the term are exhibited and analyzed concluding in the classification of characteristics most of the fake news incidents present. Following the proposed feature identification is examined through specific fake news case studies. Finally, after understanding deeper and verifying specific characteristics that appear on the nature of fake news detection and mitigation actions are proposed, demonstrating the need for technological development on the issue and educational evolution on digital skills of the public, accomplishing an inclusive review of a less studied term, such as the fake news. At last conclusions are presented leading to the main remark of the current thesis, namely the need for further quantitative and statistical research as much as deeper theoretical study, to better decipher the issue of fake news and thus resolve it

    Corpus and sentiment analysis

    Get PDF
    EThOS - Electronic Theses Online ServiceGBUnited Kingdo
    corecore