Search CORE

608 research outputs found

Lexical Normalization of Spanish Tweets with Rule-Based Components and Language Models

Author: Cuadros M. (Montse)
Etchegoyhen T. (Thierry)
Ruiz Fabo P. (Pablo)
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 30/03/2014
Field of study

This paper presents a system to normalize Spanish tweets, which uses preprocessing rules, a domain-appropriate edit-distance model, and language models to select correction candidates based on context. The system is an improvement on the tool we submitted to the Tweet-Norm 2013 shared task, and results on the task's test-corpus are above-average. Additionally, we provide a study of the impact for tweet normalization of the different components of the system: rule-based, edit-distance based and statistical

univOAK

Dialectometric analysis of language variation in Twitter

Author: Donoso Gonzalo
Sanchez David
Publication venue
Publication date: 01/01/2017
Field of study

In the last few years, microblogging platforms such as Twitter have given rise to a deluge of textual data that can be used for the analysis of informal communication between millions of individuals. In this work, we propose an information-theoretic approach to geographic language variation using a corpus based on Twitter. We test our models with tens of concepts and their associated keywords detected in Spanish tweets geolocated in Spain. We employ dialectometric measures (cosine similarity and Jensen-Shannon divergence) to quantify the linguistic distance on the lexical level between cells created in a uniform grid over the map. This can be done for a single concept or in the general case taking into account an average of the considered variants. The latter permits an analysis of the dialects that naturally emerge from the data. Interestingly, our results reveal the existence of two dialect macrovarieties. The first group includes a region-specific speech spoken in small towns and rural areas whereas the second cluster encompasses cities that tend to use a more uniform variety. Since the results obtained with the two different metrics qualitatively agree, our work suggests that social media corpora can be efficiently used for dialectometric analyses.Comment: 10 pages, 7 figures, 1 table. Accepted to VarDial 201

arXiv.org e-Print Archive

Crossref

Digital.CSIC

A modular approach for lexical normalization applied to Spanish tweets

Author: Cotelo Moya Juan Manuel
Cruz Mata Fermín
Ortega Rodríguez Francisco Javier
Troyano Jiménez José Antonio
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

Twitter is a social media platform with widespread success where millions of people continuously express ideas and opinions about a myriad of topics. It is a huge and interesting source of data but most of these texts are usually written hastily and very abbreviated, rendering them unsuitable for traditional Natural Language Processing (NLP). The two main contributions of this work are: the characterization of the textual error phenomena in Twitter and the proposal of a modular normalization system that improves the textual quality of tweets. Instead of focusing on a single technique, we propose an extensible normalization system that relies on the combination of several independent ‘‘expert modules’’, each one addressing an very specific error phenomenon in its own way, thus increasing module accuracy and lowering the module building costs. Broadly speaking, the system resembles to an ‘‘expert board’’: modules independently propose correction candidates for each Out of Vocabulary (OOV) word, rank the candidates and the best one is selected. In order to evaluate our proposal, we perform several experiments using texts from Twitter written in Spanish about a specific topic. The flexibility of defining resources at different language levels (core language, domain, genre) combined with the modular architecture lead to lower costs and a good performance: requiring a minimal effort for building the resources and achieving more than 82% of accuracy compared to the 31% yielded by the baseline.Ministerio de Economía y Competitividad TIN2012-38536-C03-02Junta de Andalucía P11-TIC-7684 M

idUS. Depósito de Investigación Universidad de Sevilla

Multilingual sentiment analysis in social media.

Author: San Vicente Roncal Iñaki
Publication venue
Publication date: 01/01/2019
Field of study

252 p.This thesis addresses the task of analysing sentiment in messages coming from social media. The ultimate goal was to develop a Sentiment Analysis system for Basque. However, because of the socio-linguistic reality of the Basque language a tool providing only analysis for Basque would not be enough for a real world application. Thus, we set out to develop a multilingual system, including Basque, English, French and Spanish.The thesis addresses the following challenges to build such a system:- Analysing methods for creating Sentiment lexicons, suitable for less resourced languages.- Analysis of social media (specifically Twitter): Tweets pose several challenges in order to understand and extract opinions from such messages. Language identification and microtext normalization are addressed.- Research the state of the art in polarity classification, and develop a supervised classifier that is tested against well known social media benchmarks.- Develop a social media monitor capable of analysing sentiment with respect to specific events, products or organizations

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital para la Docencia y la Investigación

Multilingual sentiment analysis in social media.

Author: San Vicente Roncal Iñaki
Publication venue
Publication date: 11/03/2019
Field of study

Archivo Digital para la Docencia y la Investigación

TweetNorm: a benchmark for lexical normalization of spanish tweets

Author: Alegria Iñaki
Aranberri Nora
Comas Umbert Pere Ramon
Fresno Víctor
Gamallo Pablo
Padró Lluís
San Vicente Roncal Iñaki
Turmo Borras Jorge
Zubiaga Arkaitz
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

The language used in social media is often characterized by the abundance of informal and non-standard writing. The normalization of this non-standard language can be crucial to facilitate the subsequent textual processing and to consequently help boost the performance of natural language processing tools applied to social media text. In this paper we present a benchmark for lexical normalization of social media posts, specifically for tweets in Spanish language. We describe the tweet normalization challenge we organized recently, analyze the performance achieved by the different systems submitted to the challenge, and delve into the characteristics of systems to identify the features that were useful. The organization of this challenge has led to the production of a benchmark for lexical normalization of social media, including an evaluation framework, as well as an annotated corpus of Spanish tweets-TweetNorm_es-, which we make publicly available. The creation of this benchmark and the evaluation has brought to light the types of words that submitted systems did best with, and posits the main shortcomings to be addressed in future work.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

UPCommons. Portal del coneixement obert de la UPC

Warwick Research Archives Portal Repository

Workshop Proceedings of the 12th edition of the KONVENS conference

Author: Faaß Gertrud
Ruppenhofer Josef
Publication venue: Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)
Publication date: 11/07/2023
Field of study

The 2014 issue of KONVENS is even more a forum for exchange: its main topic is the interaction between Computational Linguistics and Information Science, and the synergies such interaction, cooperation and integrated views can produce. This topic at the crossroads of different research traditions which deal with natural language as a container of knowledge, and with methods to extract and manage knowledge that is linguistically represented is close to the heart of many researchers at the Institut für Informationswissenschaft und Sprachtechnologie of Universität Hildesheim: it has long been one of the institute’s research topics, and it has received even more attention over the last few years

Publikationsserver des Instituts für Deutsche Sprache

Normalization and parsing algorithms for uncertain input

Author: van der Goot Rob Matthijs
Publication venue: 'University of Groningen Press'
Publication date: 01/01/2019
Field of study

ARTS repository - University of Groningen