29 research outputs found

    Profiling Hate Speech Spreaders on Twitter

    Get PDF
    Hate speech is defined as any public communication that depreciates a person or a group by expressing hate or encouraging violence. From the identification of the profiles of hate propagators, it is possible to avoid the spread of hate speech and keep social networks healthier. In this study, I focused on Twitter. Simply analyzing words in tweets is a good starting point to identify hate speech and people who spread hate speech. However, we believe there is value in considering other expressions that are commonly seen in tweets. The purpose of this study was to explore a variety of expressions and unveil a set of common patterns that could lead to identifying user profiles that promote hate speech on social media (Twitter)

    Phonetic Detection for Hate Speech Spreaders on Twitter

    Get PDF
    Nowadays, hate messages have become the object of study on social media. Efficient and effective detection of hate profiles requires various scientific disciplines, such as computational linguistics and sociology. Here, we illustrate how we used lexical and phonetic features to determine if the author spreads hate speech. This article presents a novel strategy for the characterization of the Twitter profile based on the generation of lexical and phonetic user features that serve as input to a set of classifiers. The results are part of our participation in the PAN 2021 in the CLEF in the task of Profiling Hate Speech Spreaders on Twitter

    Detection of Hate Speech Spreaders using convolutional neural networks

    Get PDF
    In this paper we describe a deep learning model based on a Convolutional Neural Network (CNN). The model was developed for the Profiling Hate Speech Spreaders (HSSs) task proposed by PAN 2021 organizers and hosted at the 2021 CLEF Conference. Our approach to the task of classifying an author as HSS or not (nHSS) takes advantage of a CNN based on a single convolutional layer. In this binary classification task, on the tests performed using a 5-fold cross validation, the proposed model reaches a maximum accuracy of 0.80 on the multilingual (i.e., English and Spanish) training set, and a minimum loss value of 0.51 on the same set. As announced by the task organizers, the trained model presented is able to reach an overall accuracy of 0.79 on the full test set. This overall accuracy is obtained averaging the accuracy achieved by the model on both languages. In particular, with regard to the Spanish test set, the organizers announced that our model achieves an accuracy of 0.85, while on the English test set the same model achieved - as announced by the organizers too - an accuracy of 0.73. Thanks to the model presented in this paper, our team won the 2021 PAN competition on profiling HSSs

    Deep Modeling of Latent Representations for Twitter Profiles on Hate Speech Spreaders Identification Task

    Full text link
    [EN] In this paper, we describe the system proposed by UO-UPV team for addressing the task Profiling Hate Speech Spreaders on Twitter shared at PAN 2021. The system relies on a modular architecture, combining Deep Learning models with an introduced variant of the Impostor Method (IM). It receives a single profile composed of a fixed quantity of tweets. These posts are encoded as dense feature vectors using a fine-tuned transformer model and later combined to represent the whole profile. For classifying a new profile as hate speech spreader or not, it is compared by a similarity function with the Impostor Method with respect to random sampled prototypical profiles. In the final evaluation phase, our model achieved 74% and 82% of accuracy for English and Spanish languages respectively, ranking our team at 2¿¿ position and giving a starting point for further improvements.The work of the third author was in the framework of the research project MISMIS-FAKEnHATE on MISinformation and MIScommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31), funded by Spanish Ministry of Science and Innovation, and DeepPattern (PROMETEO/2019/121), funded by the Generalitat Valenciana.Labadie Tamayo, R.; Castro Castro, D.; Ortega-Bueno, R. (2021). Deep Modeling of Latent Representations for Twitter Profiles on Hate Speech Spreaders Identification Task. CEUR. 2035-2046. http://hdl.handle.net/10251/1906692035204

    Interpreting Attention-Based Models for Natural Language Processing

    Get PDF
    Large pre-trained language models (PLMs) such as BERT and XLNet have revolutionized the field of natural language processing (NLP). The interesting thing is that they are pre- trained through unsupervised tasks, so there is a natural curiosity as to what linguistic knowledge these models have learned from only unlabeled data. Fortunately, these models’ architectures are based on self-attention mechanisms, which are naturally interpretable. As such, there is a growing body of work that uses attention to gain insight as to what linguistic knowledge is possessed by these models. Most attention-focused studies use BERT as their subject, and consequently the field is sometimes referred to as BERTology. However, despite surpassing BERT in a large number of NLP tasks, XLNet has yet to receive the same level of attention (pun intended). Additionally, there is an interest in their field in how these pre-trained models change when fine-tuned for supervised tasks. This paper details many different attention-based interpretability analyses and performs each on BERT, XLNet, and a version of XLNet fine-tuned for a Twitter hate-speech-spreader detection task. The purpose of doing so is 1. to be a comprehensive summary of the current state of BERTology 2. to be the first to do many of these in-depth analyse on XLNet and 3. to study how PLMs’ attention patterns change over fine-tuning. I find that most identified linguistic phenomenon present in the attention patterns of BERT are also present in those of XLNet to similar extents. Further, it is shown that much about the internal organization and function of PLMs, and how they change over fine-tuning, can be understood through attention

    Presidential preferences in Colombia through Sentiment Analysis

    Get PDF
    This work carries out the sentiment analysis of the social network Twitter regarding the presidential debate on May 23, where a hashtag was left open so viewers could give their points of view on these three candidates: Gustavo Petro, Federico Gutierrez, and Rodolfo Hernández. Once we extracted these Tweets contained in the hashtag, they were manually classified. They then went through all the pre-processing and elimination of special characters, links, URLs, images, or videos. Next, the TextVectorization layer from the TensorFlow library was used to convert these tweets to vectors and finally to go through the two models. The results show the best results for the BERT model with an accuracy of 76% and an F1 score of 85%

    PART: Pre-trained Authorship Representation Transformer

    Full text link
    Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage. Finding these details is very relevant to profile authors, relating back to their gender, occupation, age, and so on. But most importantly, repeating writing patterns can help attributing authorship to a text. Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors. A better approach to this task is to learn stylometric representations, but this by itself is an open research challenge. In this paper, we propose PART: a contrastively trained model fit to learn \textbf{authorship embeddings} instead of semantics. By comparing pairs of documents written by the same author, we are able to determine the proprietary of a text by evaluating the cosine similarity of the evaluated documents, a zero-shot generalization to authorship identification. To this end, a pre-trained Transformer with an LSTM head is trained with the contrastive training method. We train our model on a diverse set of authors, from literature, anonymous blog posters and corporate emails; a heterogeneous set with distinct and identifiable writing styles. The model is evaluated on these datasets, achieving zero-shot 72.39\% and 86.73\% accuracy and top-5 accuracy respectively on the joint evaluation dataset when determining authorship from a set of 250 different authors. We qualitatively assess the representations with different data visualizations on the available datasets, profiling features such as book types, gender, age, or occupation of the author

    The text classification pipeline: Starting shallow, going deeper

    Get PDF
    An increasingly relevant and crucial subfield of Natural Language Processing (NLP), tackled in this PhD thesis from a computer science and engineering perspective, is the Text Classification (TC). Also in this field, the exceptional success of deep learning has sparked a boom over the past ten years. Text retrieval and categorization, information extraction and summarization all rely heavily on TC. The literature has presented numerous datasets, models, and evaluation criteria. Even if languages as Arabic, Chinese, Hindi and others are employed in several works, from a computer science perspective the most used and referred language in the literature concerning TC is English. This is also the language mainly referenced in the rest of this PhD thesis. Even if numerous machine learning techniques have shown outstanding results, the classifier effectiveness depends on the capability to comprehend intricate relations and non-linear correlations in texts. In order to achieve this level of understanding, it is necessary to pay attention not only to the architecture of a model but also to other stages of the TC pipeline. In an NLP framework, a range of text representation techniques and model designs have emerged, including the large language models. These models are capable of turning massive amounts of text into useful vector representations that effectively capture semantically significant information. The fact that this field has been investigated by numerous communities, including data mining, linguistics, and information retrieval, is an aspect of crucial interest. These communities frequently have some overlap, but are mostly separate and do their research on their own. Bringing researchers from other groups together to improve the multidisciplinary comprehension of this field is one of the objectives of this dissertation. Additionally, this dissertation makes an effort to examine text mining from both a traditional and modern perspective. This thesis covers the whole TC pipeline in detail. However, the main contribution is to investigate the impact of every element in the TC pipeline to evaluate the impact on the final performance of a TC model. It is discussed the TC pipeline, including the traditional and the most recent deep learning-based models. This pipeline consists of State-Of-The-Art (SOTA) datasets used in the literature as benchmark, text preprocessing, text representation, machine learning models for TC, evaluation metrics and current SOTA results. In each chapter of this dissertation, I go over each of these steps, covering both the technical advancements and my most significant and recent findings while performing experiments and introducing novel models. The advantages and disadvantages of various options are also listed, along with a thorough comparison of the various approaches. At the end of each chapter, there are my contributions with experimental evaluations and discussions on the results that I have obtained during my three years PhD course. The experiments and the analysis related to each chapter (i.e., each element of the TC pipeline) are the main contributions that I provide, extending the basic knowledge of a regular survey on the matter of TC.An increasingly relevant and crucial subfield of Natural Language Processing (NLP), tackled in this PhD thesis from a computer science and engineering perspective, is the Text Classification (TC). Also in this field, the exceptional success of deep learning has sparked a boom over the past ten years. Text retrieval and categorization, information extraction and summarization all rely heavily on TC. The literature has presented numerous datasets, models, and evaluation criteria. Even if languages as Arabic, Chinese, Hindi and others are employed in several works, from a computer science perspective the most used and referred language in the literature concerning TC is English. This is also the language mainly referenced in the rest of this PhD thesis. Even if numerous machine learning techniques have shown outstanding results, the classifier effectiveness depends on the capability to comprehend intricate relations and non-linear correlations in texts. In order to achieve this level of understanding, it is necessary to pay attention not only to the architecture of a model but also to other stages of the TC pipeline. In an NLP framework, a range of text representation techniques and model designs have emerged, including the large language models. These models are capable of turning massive amounts of text into useful vector representations that effectively capture semantically significant information. The fact that this field has been investigated by numerous communities, including data mining, linguistics, and information retrieval, is an aspect of crucial interest. These communities frequently have some overlap, but are mostly separate and do their research on their own. Bringing researchers from other groups together to improve the multidisciplinary comprehension of this field is one of the objectives of this dissertation. Additionally, this dissertation makes an effort to examine text mining from both a traditional and modern perspective. This thesis covers the whole TC pipeline in detail. However, the main contribution is to investigate the impact of every element in the TC pipeline to evaluate the impact on the final performance of a TC model. It is discussed the TC pipeline, including the traditional and the most recent deep learning-based models. This pipeline consists of State-Of-The-Art (SOTA) datasets used in the literature as benchmark, text preprocessing, text representation, machine learning models for TC, evaluation metrics and current SOTA results. In each chapter of this dissertation, I go over each of these steps, covering both the technical advancements and my most significant and recent findings while performing experiments and introducing novel models. The advantages and disadvantages of various options are also listed, along with a thorough comparison of the various approaches. At the end of each chapter, there are my contributions with experimental evaluations and discussions on the results that I have obtained during my three years PhD course. The experiments and the analysis related to each chapter (i.e., each element of the TC pipeline) are the main contributions that I provide, extending the basic knowledge of a regular survey on the matter of TC
    corecore