775 research outputs found

    Active learning in annotating micro-blogs dealing with e-reputation

    Full text link
    Elections unleash strong political views on Twitter, but what do people really think about politics? Opinion and trend mining on micro blogs dealing with politics has recently attracted researchers in several fields including Information Retrieval and Machine Learning (ML). Since the performance of ML and Natural Language Processing (NLP) approaches are limited by the amount and quality of data available, one promising alternative for some tasks is the automatic propagation of expert annotations. This paper intends to develop a so-called active learning process for automatically annotating French language tweets that deal with the image (i.e., representation, web reputation) of politicians. Our main focus is on the methodology followed to build an original annotated dataset expressing opinion from two French politicians over time. We therefore review state of the art NLP-based ML algorithms to automatically annotate tweets using a manual initiation step as bootstrap. This paper focuses on key issues about active learning while building a large annotated data set from noise. This will be introduced by human annotators, abundance of data and the label distribution across data and entities. In turn, we show that Twitter characteristics such as the author's name or hashtags can be considered as the bearing point to not only improve automatic systems for Opinion Mining (OM) and Topic Classification but also to reduce noise in human annotations. However, a later thorough analysis shows that reducing noise might induce the loss of crucial information.Comment: Journal of Interdisciplinary Methodologies and Issues in Science - Vol 3 - Contextualisation digitale - 201

    A Multi-label Classification System to Distinguish among Fake, Satirical, Objective and Legitimate News in Brazilian Portuguese

    Get PDF
    Currently, there has been a significant increase in the diffusion of fake news worldwide, especially the political class, where the possible misinformation that can be propagated, appearing at the elections debates around the world. However, news with a recreational purpose, such as satirical news, is often confused with objective fake news. In this work, we decided to address the differences between objectivity and legitimacy of news documents, where each article is treated as belonging to two conceptual classes: objective/satirical and legitimate/fake. Therefore, we propose a DSS (Decision Support System) based on a Text Mining (TM) pipeline with a set of novel textual features using multi-label methods for classifying news articles on these two domains. For this, a set of multi-label methods was evaluated with a combination of different base classifiers and then compared with a multi-class approach. Also, a set of real-life news data was collected from several Brazilian news portals for these experiments. Results obtained reported our DSS as adequate (0.80 f1-score) when addressing the scenario of misleading news, challenging the multi-label perspective, where the multi-class methods (0.01 f1-score) overcome by the proposed method. Moreover, it was analyzed how each stylometric features group used in the experiments influences the result aiming to discover if a particular group is more relevant than others. As a result, it was noted that the complexity group of features could be more relevant than others

    Discussion of "A review of data science in business and industry and a future view" by Grazia Vicario and Shirley Coleman

    Full text link
    Ministerio de Economía y Competitividad, DPI2017-82896-C2-1-RFerrer, A. (2020). Discussion of "A review of data science in business and industry and a future view" by Grazia Vicario and Shirley Coleman. Applied Stochastic Models in Business and Industry. 36(1):23-29. https://doi.org/10.1002/asmb.2516232936

    Text Analytics Methods for Sentence-level Sentiment Analysis

    Get PDF
    Opinions have important effects on the process of decision making. With the explosion of text information on networks, sentiment analysis, which aims at predicting the opinions of people about specific entities, has become a popular tool to make sense of countless text information. There are multiple approaches for sentence-level sentiment analysis, including machine-learning methods and lexicon-based methods. In this MSc thesis we studied two typical sentiment analysis techniques -- AFINN and RNTN, which are also the representation of lexicon-based and machine-learning methods, respectively. The assumption of a lexicon-based method is that the sum of sentiment orientation of each word or phrase predicts the contextual sentiment polarity. AFINN is a word list with sentiment strength ranging from -5 to +5, which is constructed with the inclusion of Internet slang and obscene words. With AFINN, we extract sentiment words from sentences and sentiment scores are then assigned to these words. The sentiment of a sentence is aggregated as the sum of scores from all its words. The Stanford Sentiment Treebank is a corpus with labeled parse trees, which provides the community with the possibility to train compositional models based on supervised machine learning techniques. The labels of Stanford Sentiment Treebank involve 5 categories: negative, somewhat negative, neutral, somewhat positive and positive. Compared to the standard recursive neural network (RNN) and Matrix-Vector RNN, Recursive Neural Tensor Network (RNTN) is a more powerful composition model to compute compositional vector representations for input sentences. Dependent on the Stanford Sentiment Treebank, RNTN can predict the sentiment of input sentences by its computed vector representations. With the benchmark datasets that cover diverse data sources, we carry out a thorough comparison between AFINN and RNTN. Our results highlight that although RNTN is much more complicated than AFINN, the performance of RNTN is not better than that of AFINN. To some extent, AFINN is more simple, more generic and takes less computation resources than RNTN in sentiment analysis

    Learning Representations of Social Media Users

    Get PDF
    User representations are routinely used in recommendation systems by platform developers, targeted advertisements by marketers, and by public policy researchers to gauge public opinion across demographic groups. Computer scientists consider the problem of inferring user representations more abstractly; how does one extract a stable user representation - effective for many downstream tasks - from a medium as noisy and complicated as social media? The quality of a user representation is ultimately task-dependent (e.g. does it improve classifier performance, make more accurate recommendations in a recommendation system) but there are proxies that are less sensitive to the specific task. Is the representation predictive of latent properties such as a person's demographic features, socioeconomic class, or mental health state? Is it predictive of the user's future behavior? In this thesis, we begin by showing how user representations can be learned from multiple types of user behavior on social media. We apply several extensions of generalized canonical correlation analysis to learn these representations and evaluate them at three tasks: predicting future hashtag mentions, friending behavior, and demographic features. We then show how user features can be employed as distant supervision to improve topic model fit. Finally, we show how user features can be integrated into and improve existing classifiers in the multitask learning framework. We treat user representations - ground truth gender and mental health features - as auxiliary tasks to improve mental health state prediction. We also use distributed user representations learned in the first chapter to improve tweet-level stance classifiers, showing that distant user information can inform classification tasks at the granularity of a single message.Comment: PhD thesi

    Tunneanalyysi koneoppimisen avulla

    Get PDF
    In recent years, social media and TV-production has formed a strong link between each other. The most popular social media platform in TV-industry is Twitter, where over a million tweets are shared in one day. Tweet content is feedback straight from the viewers, and might include more valuable information than individual surveys. Going through millions of tweets is hard or impossible manually. This thesis studies, how to teach a machine by supervised manner to analyze tweets. Machine analyzes sentiments based on the features that tweets include. The main goal of this thesis is to clarify how the content can be received, prepared, extracted and classified. The study indicates that sentiments can be caught from Twitter data using mathematical patterns. The thesis is divided into 5 chapters. Chapter 1 is the introduction for the sentiment analyzing with machine learning capabilities. Chapter 2 is the literature study part, where elements and techniques are explored. Chapter 3 is the implementation part, where selected classification methods and techniques for text data are specified. Chapter 4 covers results and chapter 5 finishes the work with conclusions

    Learning Representations of Social Media Users

    Get PDF
    User representations are routinely used in recommendation systems by platform developers, targeted advertisements by marketers, and by public policy researchers to gauge public opinion across demographic groups. Computer scientists consider the problem of inferring user representations more abstractly; how does one extract a stable user representation - effective for many downstream tasks - from a medium as noisy and complicated as social media? The quality of a user representation is ultimately task-dependent (e.g. does it improve classifier performance, make more accurate recommendations in a recommendation system) but there are proxies that are less sensitive to the specific task. Is the representation predictive of latent properties such as a person's demographic features, socioeconomic class, or mental health state? Is it predictive of the user's future behavior? In this thesis, we begin by showing how user representations can be learned from multiple types of user behavior on social media. We apply several extensions of generalized canonical correlation analysis to learn these representations and evaluate them at three tasks: predicting future hashtag mentions, friending behavior, and demographic features. We then show how user features can be employed as distant supervision to improve topic model fit. Finally, we show how user features can be integrated into and improve existing classifiers in the multitask learning framework. We treat user representations - ground truth gender and mental health features - as auxiliary tasks to improve mental health state prediction. We also use distributed user representations learned in the first chapter to improve tweet-level stance classifiers, showing that distant user information can inform classification tasks at the granularity of a single message.Comment: PhD thesi
    corecore