1,050 research outputs found

    Emotion intensities in Tweets

    Get PDF
    This paper examines the task of detecting intensity of emotion from text. We create the first datasets of tweets annotated for anger, fear, joy, and sadness intensities. We use a technique called best–worst scaling (BWS) that improves annotation consistency and obtains reliable fine-grained scores. We show that emotion-word hashtags often impact emotion intensity, usually conveying a more intense emotion. Finally, we create a benchmark regression system and conduct experiments to determine: which features are useful for detecting emotion intensity; and, the extent to which two emotions are similar in terms of how they manifest in language

    Acquiring and Exploiting Lexical Knowledge for Twitter Sentiment Analysis

    Get PDF
    The most popular sentiment analysis task in Twitter is the automatic classification of tweets into sentiment categories such as positive, negative, and neutral. State-of-the-art solutions to this problem are based on supervised machine learning models trained from manually annotated examples. These models are affected by label sparsity, because the manual annotation of tweets is labour-intensive and time-consuming. This thesis addresses the label sparsity problem for Twitter polarity classification by automatically building two type of resources that can be exploited when labelled data is scarce: opinion lexicons, which are lists of words labelled by sentiment, and synthetically labelled tweets. In the first part of the thesis, we induce Twitter-specific opinion lexicons by training words level classifiers using representations that exploit different sources of information: (a) the morphological information conveyed by part-of-speech (POS) tags, (b) associations between words and the sentiment expressed in the tweets that contain them, and (c) distributional representations calculated from unlabelled tweets. Experimental results show that the induced lexicons produce significant improvements over existing manually annotated lexicons for tweet-level polarity classification. In the second part of the thesis, we develop distant supervision methods for generating synthetic training data for Twitter polarity classification by exploiting unlabelled tweets and prior lexical knowledge. Positive and negative training instances are generated by averaging unlabelled tweets annotated according to a given polarity lexicon. We study different mechanisms for selecting the candidate tweets to be averaged. Our experimental results show that the training data generated by the proposed models produce classifiers that perform significantly better than classifiers trained from tweets annotated with emoticons, a popular distant supervision approach for Twitter sentiment analysis

    Regional Convergence, Spatial Scale, and Spatial Dependence: Evidence from Homicides and Personal Injuries in Colombia 2010-2018

    Get PDF
    This paper studies regional convergence and spatial dependence of homicides and personal injuries in Colombia. In particular, through the lens of both classical and distributional convergence frameworks, two spatial scales are contrasted: municipalities and states. For both homicides and personal injuries, sigma convergence is only found at the state level. In contrast, beta convergence is found at both state and municipal level. The distributional convergence framework highlights further contrasting patterns. For homicides at the state level, four convergence clusters are found, while two clusters are present at the municipal level. For personal injuries, at both spatial scales, two clusters are found. Moreover, significant and robust spatial autocorrelation is found only at the municipal level. Overall, these results re-emphasize the role of spatial disaggregation as well as spatial dependence when evaluating regional convergence and designing regional development policies. Lastly, a discussion of the previous results and their relation to current and future policies is also included

    ALBETO and DistilBETO: Lightweight Spanish Language Models

    Full text link
    In recent years there have been considerable advances in pre-trained language models, where non-English language versions have also been made available. Due to their increasing use, many lightweight versions of these models (with reduced parameters) have also been released to speed up training and inference times. However, versions of these lighter models (e.g., ALBERT, DistilBERT) for languages other than English are still scarce. In this paper we present ALBETO and DistilBETO, which are versions of ALBERT and DistilBERT pre-trained exclusively on Spanish corpora. We train several versions of ALBETO ranging from 5M to 223M parameters and one of DistilBETO with 67M parameters. We evaluate our models in the GLUES benchmark that includes various natural language understanding tasks in Spanish. The results show that our lightweight models achieve competitive results to those of BETO (Spanish-BERT) despite having fewer parameters. More specifically, our larger ALBETO model outperforms all other models on the MLDoc, PAWS-X, XNLI, MLQA, SQAC and XQuAD datasets. However, BETO remains unbeaten for POS and NER. As a further contribution, all models are publicly available to the community for future research.Comment: Accepted paper at LREC202

    Identifying Customer Preferences about Tourism Products Using an Aspect-based Opinion Mining Approach

    Get PDF
    AbstractIn this study we extend Bing Liu's aspect-based opinion mining technique to apply it to the tourism domain. Using this extension, we also offer an approach for considering a new alternative to discover consumer preferences about tourism products, particularly hotels and restaurants, using opinions available on the Web as reviews. An experiment is also conducted, using hotel and restaurant reviews obtained from TripAdvisor, to evaluate our proposals. Results showed that tourism product reviews available on web sites contain valuable information about customer preferences that can be extracted using an aspect-based opinion mining approach. The proposed approach proved to be very effective in determining the sentiment orientation of opinions, achieving a precision and recall of 90%. However, on average, the algorithms were only capable of extracting 35% of the explicit aspect expressions

    WASSA-2017 shared task on emotion intensity

    Get PDF
    We present the first shared task on detecting the intensity of emotion felt by the speaker of a tweet. We create the first datasets of tweets annotated for anger, fear, joy, and sadness intensities using a technique called best–worst scaling (BWS). We show that the annotations lead to reliable fine-grained intensity scores (rankings of tweets by intensity). The data was partitioned into training, development, and test sets for the competition. Twenty-two teams participated in the shared task, with the best system obtaining a Pearson correlation of 0.747 with the gold intensity scores. We summarize the machine learning setups, resources, and tools used by the participating teams, with a focus on the techniques and resources that are particularly useful for the task. The emotion intensity dataset and the shared task are helping improve our understanding of how we convey more or less intense emotions through language

    Regional Convergence and Spatial Dependence across Subnational Regions of ASEAN: Evidence from Satellite Nighttime Light Data

    Get PDF
    Satellite nighttime light data are increasingly used for evaluating the performance of economies in which official statics are non-existent, limited, or non-comparable. In this paper,we use a novel luminosity-based measure of GDP per capita to study regional convergence and spatial dependence across 274 subnational regions of the Association of South East Asian Nations(ASEAN) over the 1998-2012 period. Specifically, we first evaluate the usefulness of this new luminosity indicator in the context of ASEAN regions. Results show that almost 60 percent of the differences in (official) GDP per capita can be predicted by this luminosity-based measure of GDP. Next, given its potential usefulness for predicting regional GDP, we evaluate the spatio-temporal dynamics of regional inequality across ASEAN. Results indicate that although there is an overall (average) process of regional convergence, regional inequality within most countries has not significantly decreased. When evaluating the patterns of spatial dependence, we find increasing spatial dependence over time and stable spatial clusters (hotspots and coldspots) that are located across multiple national boundaries. Taken together, these results provide a new and more disaggregated perspective of the integration process of the ASEAN community

    Hybrid Hashtags: #YouKnowYoureAKiwiWhen Your Tweet Contains Māori and English

    Get PDF
    Twitter constitutes a rich resource for investigating language contact phenomena. In this paper, we report findings from the analysis of a large-scale diachronic corpus of over one million tweets, containing loanwords from te reo Maori, the indigenous language spoken in New Zealand, into (primarily, New Zealand) English. Our analysis focuses on hashtags comprising mixed-language resources (which we term hybrid hashtags), bringing together descriptive linguistic tools (investigating length, word class, and semantic domains of the hashtags) and quantitative methods (Random Forests and regression analysis). Our work has implications for language change and the study of loanwords (we argue that hybrid hashtags can be linked to loanword entrenchment), and for the study of language on social media (we challenge proposals of hashtags as “words,” and show that hashtags have a dual discourse role: a micro-function within the immediate linguistic context in which they occur and a macro-function within the tweet as a whole)
    corecore