12 research outputs found
New methodologies to evaluate the consistency of emoji sentiment lexica and alternatives to generate them in a fully automatic unsupervised way
Sentiment analysis aims at detecting sentiment polarities in unstructured Internet information. A relevant part of this information for that purpose, emojis, whose use in Twitter has grown considerably in these years, deserves attention. However, every time a new version of Unicode is released, finding out the sentiment users wish to express with a new emoji is challenging. In [KNSSM15], an Emoji Sentiment Ranking lexicon from manual annotations of messages in different languages was presented. The quality of these annotations affects directly the quality of possible generated emoji sentiment lexica (high quality corresponds to high self-agreement and inter-agreement). In many cases, the creators of the datasets do not provide any quality metrics, so it is necessary to use another strategy to detect this issue. Therefore, we propose an automatic approach to identify and manage inconsistent manual sentiment annotations. Then, relying on a new approach to generate emoji sentiment lexica of good quality, we compare two such lexica with lexica created from manually annotated datasets with poor and high qualities
Differentiating users by language and location estimation in sentiment analisys of informal text during major public events
In recent years there has been intense work on the analysis of social media to support marketing campaigns. A proper methodology for sentiment analysis is a crucial asset in this regard. However, when monitoring major public events the behaviour or social media users may be strongly biased by punctual actions of the participating characters and the sense of group belonging, which is typically linked to specific geographical areas. In this paper, we present a solution combining a location prediction methodology with an unsupervised technique for sentiment analysis to assess automatically the engagement of social network users in different countries during an event with worldwide impact. As far as the authors know, this is the first time such techniques are jointly considered. We demonstrate that the technique is coherent with the intrinsic disposition of individual users to typical actions of the characters participating in the events, as well as with the sense of group belonging.Ministerio de Economía, Industria y Competitividad | Ref. TEC2016-76465-C2-2-RXunta de Galicia | Ref. GRC2014/046Xunta de Galicia | Ref. ED341D R2016/01
Lexicon for natural language generation in spanish adapted to alternative and augmentative communication
In this paper we present Elsa, the first lexicon for Spanish with morphological, syntactic and semantic information automatically generated from a well-known pictogram resource and especially tailored for Augmentative and Alternative Communication (AAC). This lexicon, focusing on that specific icon set widely used within AAC applications, is motivated by the need to improve Natural Language Generation (NLG) systems to aid people who have been diagnosed to suffer from communication disorders. In addition, we design an automatic lexicon extension procedure by means of a training process to complete the linguistic data. For this we used a dataset composed of novels and tales in Spanish, with pictogram representations, since the lexicon is meant for AAC applications for children with disabilities. Moreover, we provide the algorithms used to build our lexicon and a use case of Elsa within an NLG system to observe the usability of our proposal.Agencia Estatal de Investigación | Ref. TEC2016-76465-C2-2-RXunta de Galicia | Ref. GRC2014/04
GTI en TASS 2016 : Una aproximaci on supervisada para el an alisis de sentimiento basado en aspectos en Twitter
This paper describes the participation of the GTI research group of AtlantTIC, University of Vigo, in TASS 2016. This workshop is framed within the XXXII edition of the Annual Congress of the Spanish Society for Natural Language Processing event. In this work we propose a supervised approach based on classi ers, for the aspect based sentiment analysis task. Using this technique we managed to improve the performance of previous years, obtaining a solution reflecting the actual state-of-the-art.Este artículo describe la participación del grupo de investigación GTI, del centro AtlantTIC, perteneciente a la Universidad de Vigo, en el TASS 2016. Este taller es un evento enmarcado dentro de la XXXII edición del Congreso Anual de la Sociedad Española para el Procesamiento del Lenguaje Natural. En este trabajo se propone una aproximación supervisada, basada en clasificadores, para la tarea de análisis de sentimiento basado en aspectos. Mediante esta técnica hemos conseguido mejorar las prestaciones de ediciones anteriores, obteniendo una solución acorde con el estado del arte actual.Ministerio de Economía y Competitividad | Ref. TEC2013-47016-C2-1-RXunta de Galicia | Ref. GRC2014/04
Evaluation of online emoji description resources for sentiment analysis purposes
Emoji sentiment analysis is a relevant research topic nowadays, for which emoji sentiment lexica are key assets. Manual annotation affects directly their quality (where high quality usually corresponds to high self-agreement and inter-agreement). In this work we present an unsupervised methodology to evaluate emoji sentiment lexica generated from online resources, based on a correlation analysis between a gold standard and the scores resulting from the sentiment analysis of the emoji descriptions in those resources. We consider in our study four such online resources of emoji descriptions: Emojipedia, Emojis.wiki, CLDR emoji character annotations and iEmoji. These resources provide knowledge about real (intended) emoji meanings from different author approaches and perspectives. We also present the automatic creation of a joint lexicon where the sentiment of a given emoji is obtained by averaging its scores from the unsupervised analysis of all the resources involved. The results for the joint lexicon are highly promising, suggesting that valuable subjective information can be inferred from authors’ descriptions in online resources.Agencia Estatal de Investigación | Ref. TEC2016-76465-C2-2-RXunta de Galicia | Ref. GRC2018/05
A library for automatic natural language generation of Spanish texts
In this article we present a novel system for natural language generation (nlg) of Spanish sentences from a minimum set of meaningful words (such as nouns, verbs and adjectives) which, unlike other state-of-the-art solutions, performs the nlg task in a fully automatic way, exploiting both knowledge-based and statistical approaches. Relying on its linguistic knowledge of vocabulary and grammar, the system is able to generate complete, coherent and correctly spelled sentences from the main word sets presented by the user. The system, which was designed to be integrable, portable and efficient, can be easily adapted to other languages by design and can feasibly be integrated in a wide range of digital devices. During its development we also created a supplementary lexicon for Spanish, aLexiS, with wide coverage and high precision, as well as syntactic trees from a freely available definite-clause grammar. The resulting nlg library has been evaluated both automatically and manually (annotation). The system can potentially be used in different application domains such as augmentative communication and automatic generation of administrative reports or news.Xunta de Galicia | Ref. ED341D R2016/012Xunta de Galicia | Ref. GRC 2014/046Ministerio de Economía, Industria y Competitividad | Ref. TEC2016-76465-C2-2-
Creating emoji lexica from unsupervised sentiment analysis of their descriptions
Online media, such as blogs and social networking sites, generate massive volumes of unstructured data of great interest to analyze the opinions and sentiments of individuals and organizations. Novel approaches beyond Natural Language Processing are necessary to quantify these opinions with polarity metrics. So far, the sentiment expressed by emojis has received little attention. The use of symbols, however, has boomed in the past four years. About twenty billion are typed in Twitter nowadays, and new emojis keep appearing in each new Unicode version, making them increasingly relevant to sentiment analysis tasks. This has motivated us to propose a novel approach to predict the sentiments expressed by emojis in online textual messages, such as tweets, that does not require human effort to manually annotate data and saves valuable time for other analysis tasks. For this purpose, we automatically constructed a novel emoji sentiment lexicon using an unsupervised sentiment analysis system based on the definitions given by emoji creators in Emojipedia. Additionally, we automatically created lexicon variants by also considering the sentiment distribution of the informal texts accompanying emojis. All these lexica are evaluated and compared regarding the improvement obtained by including them in sentiment analysis of the annotated datasets provided by Kralj Novak, Smailovic, Sluban and Mozetic (2015). The results confirm the competitiveness of our approach.Agencia Estatal de Investigación | Ref. TEC2016-76465-C2-2-RXunta de Galicia | Ref. GRC2014/046Xunta de Galicia | Ref. ED341D R2016/01
A system for automatic English text expansion
We present an automatic text expansion system to generate English sentences, which performs automatic Natural Language Generation (NLG) by combining linguistic rules with statistical approaches. Here, “automatic” means that the system can generate coherent and correct sentences from a minimum set of words. From its inception, the design is modular and adaptable to other languages. This adaptability is one of its greatest advantages. For English, we have created the highly precise aLexiE lexicon with wide coverage, which represents a contribution on its own. We have evaluated the resulting NLG library in an Augmentative and Alternative Communication (AAC) proof of concept, both directly (by regenerating corpus sentences) and manually (from annotations) using a popular corpus in the NLG field. We performed a second analysis by comparing the quality of text expansion in English to Spanish, using an ad-hoc Spanish-English parallel corpus. The system might also be applied to other domains such as report and news generation.Ministerio de Economía, Industria y Competitividad | Ref. TEC2016-76465-C2-2-RXunta de Galicia | Ref. GRC-2018/53Xunta de Galicia | Ref. ED341D R2016/012University of Aberdee
A System for Automatic English Text Expansion
This work was supported in part by the Mineco, Spain, under Grant TEC2016-76465-C2-2-R, in part by the Xunta de Galicia, Spain, under Grant GRC-2018/53 and Grant ED341D R2016/012, and in part by the University of Vigo Travel Grant to visit the CLAN Research Group, University of Aberdeen, U.K.Peer reviewedPublisher PD
Identifying banking transaction descriptions via support vector machine short-text classification based on a specialized labelled corpus
Short texts are omnipresent in real-time news, social network commentaries, etc. Traditional text representation methods have been successfully applied to self-contained documents of medium size. However, information in short texts is often insufficient, due, for example, to the use of mnemonics, which makes them hard to classify. Therefore, the particularities of specific domains must be exploited. In this article we describe a novel system that combines Natural Language Processing techniques with Machine Learning algorithms to classify banking transaction descriptions for personal finance management, a problem that was not previously considered in the literature. We trained and tested that system on a labelled dataset with real customer transactions that will be available to other researchers on request. Motivated by existing solutions in spam detection, we also propose a short text similarity detector to reduce training set size based on the Jaccard distance. Experimental results with a two-stage classifier combining this detector with a SVM indicate a high accuracy in comparison with alternative approaches, taking into account complexity and computing time. Finally, we present a use case with a personal finance application, CoinScrap, which is available at "Google Play" and "App Store".Ministerio de Economía, Industria y Competitividad | Ref. TEC2016-76465-C2-2-RXunta de Galicia | Ref. GRC2018/053Xunta de Galicia | Ref. ED341D-R2016/01