1,337 research outputs found

    Using Word Embeddings in Twitter Election Classification

    Get PDF
    Word embeddings and convolutional neural networks (CNN) have attracted extensive attention in various classification tasks for Twitter, e.g. sentiment classification. However, the effect of the configuration used to train and generate the word embeddings on the classification performance has not been studied in the existing literature. In this paper, using a Twitter election classification task that aims to detect election-related tweets, we investigate the impact of the background dataset used to train the embedding models, the context window size and the dimensionality of word embeddings on the classification performance. By comparing the classification results of two word embedding models, which are trained using different background corpora (e.g. Wikipedia articles and Twitter microposts), we show that the background data type should align with the Twitter classification dataset to achieve a better performance. Moreover, by evaluating the results of word embeddings models trained using various context window sizes and dimensionalities, we found that large context window and dimension sizes are preferable to improve the performance. Our experimental results also show that using word embeddings and CNN leads to statistically significant improvements over various baselines such as random, SVM with TF-IDF and SVM with word embeddings

    Using word embeddings in Twitter election classification

    Get PDF
    Word embeddings and convolutional neural networks (CNN) have attracted extensive attention in various classification tasks for Twitter, e.g. sentiment classification. However, the effect of the configuration used to generate the word embeddings on the classification performance has not been studied in the existing literature. In this paper, using a Twitter election classification task that aims to detect election-related tweets, we investigate the impact of the background dataset used to train the embedding models, as well as the parameters of the word embedding training process, namely the context window size, the dimensionality and the number of negative samples, on the attained classification performance. By comparing the classification results of word embedding models that have been trained using different background corpora (e.g. Wikipedia articles and Twitter microposts), we show that the background data should align with the Twitter classification dataset both in data type and time period to achieve significantly better performance compared to baselines such as SVM with TF-IDF. Moreover, by evaluating the results of word embedding models trained using various context window sizes and dimensionalities, we find that large context window and dimension sizes are preferable to improve the performance. However, the number of negative samples parameter does not significantly affect the performance of the CNN classifiers. Our experimental results also show that choosing the correct word embedding model for use with CNN leads to statistically significant improvements over various baselines such as random, SVM with TF-IDF and SVM with word embeddings. Finally, for out-of-vocabulary (OOV) words that are not available in the learned word embedding models, we show that a simple OOV strategy to randomly initialise the OOV words without any prior knowledge is sufficient to attain a good classification performance among the current OOV strategies (e.g. a random initialisation using statistics of the pre-trained word embedding models)

    On the Reproducibility and Generalisation of the Linear Transformation of Word Embeddings

    Get PDF
    Linear transformation is a way to learn a linear relationship between two word embeddings, such that words in the two different embedding spaces can be semantically related. In this paper, we examine the reproducibility and generalisation of the linear transformation of word embeddings. Linear transformation is particularly useful when translating word embedding models in different languages, since it can capture the semantic relationships between two models. We first reproduce two linear transformation approaches, a recent one using orthogonal transformation and the original one using simple matrix transformation. Previous findings on a machine translation task are re-examined, validating that linear transformation is indeed an effective way to transform word embedding models in different languages. In particular, we show that the orthogonal transformation can better relate the different embedding models. Following the verification of previous findings, we then study the generalisation of linear transformation in a multi-language Twitter election classification task. We observe that the orthogonal transformation outperforms the matrix transformation. In particular, it significantly outperforms the random classifier by at least 10% under the F1 metric across English and Spanish datasets. In addition, we also provide best practices when using linear transformation for multi-language Twitter election classification

    Transfer Learning for Multi-language Twitter Election Classification

    Get PDF
    Both politicians and citizens are increasingly embracing social media as a means to disseminate information and comment on various topics, particularly during significant political events, such as elections. Such commentary during elections is also of interest to social scientists and pollsters. To facilitate the study of social media during elections, there is a need to automatically identify posts that are topically related to those elections. However, current studies have focused on elections within English-speaking regions, and hence the resultant election content classifiers are only applicable for elections in countries where the predominant language is English. On the other hand, as social media is becoming more prevalent worldwide, there is an increasing need for election classifiers that can be generalised across different languages, without building a training dataset for each election. In this paper, based upon transfer learning, we study the development of effective and reusable election classifiers for use on social media across multiple languages. We combine transfer learning with different classifiers such as Support Vector Machines (SVM) and state-of-the-art Convolutional Neural Networks (CNN), which make use of word embedding representations for each social media post. We generalise the learned classifier models for cross-language classification by using a linear translation approach to map the word embedding vectors from one language into another. Experiments conducted over two election datasets in different languages show that without using any training data from the target language, linear translations outperform a classical transfer learning approach, namely Transfer Component Analysis (TCA), by 80% in recall and 25% in F1 measure

    How Polarized Have We Become? A Multimodal Classification of Trump Followers and Clinton Followers

    Full text link
    Polarization in American politics has been extensively documented and analyzed for decades, and the phenomenon became all the more apparent during the 2016 presidential election, where Trump and Clinton depicted two radically different pictures of America. Inspired by this gaping polarization and the extensive utilization of Twitter during the 2016 presidential campaign, in this paper we take the first step in measuring polarization in social media and we attempt to predict individuals' Twitter following behavior through analyzing ones' everyday tweets, profile images and posted pictures. As such, we treat polarization as a classification problem and study to what extent Trump followers and Clinton followers on Twitter can be distinguished, which in turn serves as a metric of polarization in general. We apply LSTM to processing tweet features and we extract visual features using the VGG neural network. Integrating these two sets of features boosts the overall performance. We are able to achieve an accuracy of 69%, suggesting that the high degree of polarization recorded in the literature has started to manifest itself in social media as well.Comment: 16 pages, SocInfo 2017, 9th International Conference on Social Informatic

    Measuring relative opinion from location-based social media: A case study of the 2016 U.S. presidential election

    Get PDF
    Social media has become an emerging alternative to opinion polls for public opinion collection, while it is still posing many challenges as a passive data source, such as structurelessness, quantifiability, and representativeness. Social media data with geotags provide new opportunities to unveil the geographic locations of users expressing their opinions. This paper aims to answer two questions: 1) whether quantifiable measurement of public opinion can be obtained from social media and 2) whether it can produce better or complementary measures compared to opinion polls. This research proposes a novel approach to measure the relative opinion of Twitter users towards public issues in order to accommodate more complex opinion structures and take advantage of the geography pertaining to the public issues. To ensure that this new measure is technically feasible, a modeling framework is developed including building a training dataset by adopting a state-of-the-art approach and devising a new deep learning method called Opinion-Oriented Word Embedding. With a case study of the tweets selected for the 2016 U.S. presidential election, we demonstrate the predictive superiority of our relative opinion approach and we show how it can aid visual analytics and support opinion predictions. Although the relative opinion measure is proved to be more robust compared to polling, our study also suggests that the former can advantageously complement the later in opinion prediction

    Detecting and Monitoring Hate Speech in Twitter

    Get PDF
    Social Media are sensors in the real world that can be used to measure the pulse of societies. However, the massive and unfiltered feed of messages posted in social media is a phenomenon that nowadays raises social alarms, especially when these messages contain hate speech targeted to a specific individual or group. In this context, governments and non-governmental organizations (NGOs) are concerned about the possible negative impact that these messages can have on individuals or on the society. In this paper, we present HaterNet, an intelligent system currently being used by the Spanish National Office Against Hate Crimes of the Spanish State Secretariat for Security that identifies and monitors the evolution of hate speech in Twitter. The contributions of this research are many-fold: (1) It introduces the first intelligent system that monitors and visualizes, using social network analysis techniques, hate speech in Social Media. (2) It introduces a novel public dataset on hate speech in Spanish consisting of 6000 expert-labeled tweets. (3) It compares several classification approaches based on different document representation strategies and text classification models. (4) The best approach consists of a combination of a LTSM+MLP neural network that takes as input the tweet’s word, emoji, and expression tokens’ embeddings enriched by the tf-idf, and obtains an area under the curve (AUC) of 0.828 on our dataset, outperforming previous methods presented in the literatureThe work by Quijano-Sanchez was supported by the Spanish Ministry of Science and Innovation grant FJCI-2016-28855. The research of Liberatore was supported by the Government of Spain, grant MTM2015-65803-R, and by the European Union’s Horizon 2020 Research and Innovation Programme, under the Marie Sklodowska-Curie grant agreement No. 691161 (GEOSAFE). All the financial support is gratefully acknowledge

    When Politicians Talk About Politics: Identifying Political Tweets of Brazilian Congressmen

    Full text link
    Since June 2013, when Brazil faced the largest and most significant mass protests in a generation, a political crisis is in course. In midst of this crisis, Brazilian politicians use social media to communicate with the electorate in order to retain or to grow their political capital. The problem is that many controversial topics are in course and deputies may prefer to avoid such themes in their messages. To characterize this behavior, we propose a method to accurately identify political and non-political tweets independently of the deputy who posted it and of the time it was posted. Moreover, we collected tweets of all congressmen who were active on Twitter and worked in the Brazilian parliament from October 2013 to October 2017. To evaluate our method, we used word clouds and a topic model to identify the main political and non-political latent topics in parliamentarian tweets. Both results indicate that our proposal is able to accurately distinguish political from non-political tweets. Moreover, our analyses revealed a striking fact: more than half of the messages posted by Brazilian deputies are non-political.Comment: 4 pages, 7 figures, 2 table

    MoralStrength: Exploiting a Moral Lexicon and Embedding Similarity for Moral Foundations Prediction

    Get PDF
    Moral rhetoric plays a fundamental role in how we perceive and interpret the information we receive, greatly influencing our decision-making process. Especially when it comes to controversial social and political issues, our opinions and attitudes are hardly ever based on evidence alone. The Moral Foundations Dictionary (MFD) was developed to operationalize moral values in the text. In this study, we present MoralStrength, a lexicon of approximately 1,000 lemmas, obtained as an extension of the Moral Foundations Dictionary, based on WordNet synsets. Moreover, for each lemma it provides with a crowdsourced numeric assessment of Moral Valence, indicating the strength with which a lemma is expressing the specific value. We evaluated the predictive potentials of this moral lexicon, defining three utilization approaches of increased complexity, ranging from lemmas' statistical properties to a deep learning approach of word embeddings based on semantic similarity. Logistic regression models trained on the features extracted from MoralStrength, significantly outperformed the current state-of-the-art, reaching an F1-score of 87.6% over the previous 62.4% (p-value<0.01), and an average F1-Score of 86.25% over six different datasets. Such findings pave the way for further research, allowing for an in-depth understanding of moral narratives in text for a wide range of social issues
    • …
    corecore