Search CORE

7 research outputs found

On the performance of phonetic algorithms in microtext normalization

Author: Doval Mosquera Yerai
Vilares Ferro Jesús
Vilares Ferro Manuel
Publication venue: COmputational LEarnig
Publication date: 05/12/2023
Field of study

User–generated content published on microblogging social networks constitutes a priceless source of information. However, microtexts usually deviate from the standard lexical and grammatical rules of the language, thus making its processing by traditional intelligent systems very difficult. As an answer, microtext normalization consists in transforming those non–standard microtexts into standard well–written texts as a preprocessing step, allowing traditional approaches to continue with their usual processing. Given the importance of phonetic phenomena in non–standard text formation, an essential element of the knowledge base of a normalizer would be the phonetic rules that encode these phenomena, which can be found in the so–called phonetic algorithms. In this work we experiment with a wide range of phonetic algorithms for the English language. The aim of this study is to determine the best phonetic algorithms within the context of candidate generation for microtext normalization. In other words, we intend to find those algorithms that taking as input non–standard terms to be normalized allow us to obtain as output the smallest possible sets of normalization candidates which still contain the corresponding target standard words. As it will be stated, the choice of the phonetic algorithm will depend heavily on the capabilities of the candidate selection mechanism which we usually find at the end of a microtext normalization pipeline. The faster it can make the right choices among big enough sets of candidates, the more we can sacrifice on the precision of the phonetic algorithms in favour of coverage in order to increase the overall performance of the normalization systemAgencia Estatal de Investigación | Ref. TIN2017-85160-C2-1-RAgencia Estatal de Investigación | Ref. TIN2017-85160-C2-2-RMinisterio de Economía y Competitividad | Ref. FFI2014-51978-C2-1-RMinisterio de Economía y Competitividad | Ref. FFI2014-51978-C2-2-RXunta de Galicia | Ref. ED431D-2017/12Xunta de Galicia | Ref. ED431B2017/01Xunta de Galicia | Ref. ED431D R2016/046Ministerio de Economía y Competitividad | Ref. BES-2015-07376

Investigo

On the performance of phonetic algorithms in microtext normalization

Author: Doval Yerai
Vilares Ferro Manuel
Vilares Jesús
Publication venue: Elsevier
Publication date: 15/12/2018
Field of study

© 2018. This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/. This version of the article: Doval, Y., Vilares, M. and Vilares, J. (2018) ‘On the performance of phonetic algorithms in microtext normalization’ has been accepted for publication in: Expert Systems with Applications, 113, pp. 213–222. The Version of Record is available online at: https://doi.org/10.1016/j.eswa.2018.07.016[Abstract]: User–generated content published on microblogging social networks constitutes a priceless source of information. However, microtexts usually deviate from the standard lexical and grammatical rules of the language, thus making its processing by traditional intelligent systems very difficult. As an answer, microtext normalization consists in transforming those non–standard microtexts into standard well–written texts as a preprocessing step, allowing traditional approaches to continue with their usual processing. Given the importance of phonetic phenomena in non–standard text formation, an essential element of the knowledge base of a normalizer would be the phonetic rules that encode these phenomena, which can be found in the so–called phonetic algorithms. In this work we experiment with a wide range of phonetic algorithms for the English language. The aim of this study is to determine the best phonetic algorithms within the context of candidate generation for microtext normalization. In other words, we intend to find those algorithms that taking as input non–standard terms to be normalized allow us to obtain as output the smallest possible sets of normalization candidates which still contain the corresponding target standard words. As it will be stated, the choice of the phonetic algorithm will depend heavily on the capabilities of the candidate selection mechanism which we usually find at the end of a microtext normalization pipeline. The faster it can make the right choices among big enough sets of candidates, the more we can sacrifice on the precision of the phonetic algorithms in favour of coverage in order to increase the overall performance of the normalization system.This research has been partially funded by the Spanish Ministry of Economy, Industry and Competitiveness (MINECO) through projects TIN2017-85160-C2-1-R, TIN2017-85160-C2-2-R, FFI2014-51978-C2-1-R and FFI2014-51978-C2-2-R, and by the Autonomous Government of Galicia through projects ED431D-2017/12, ED431B-2017/01 and ED431D R2016/046. Moreover, Yerai Doval is funded by the Spanish State Secretariat for Research, Development and Innovation (which belongs to MINECO) and by the European Social Fund (ESF) under a FPI fellowship (BES-2015-073768) associated to project FFI2014-51978-C2-1-R.Xunta de Galicia; ED431D-2017/12Xunta de Galicia; ED431B-2017/01Xunta de Galicia; ED431D R2016/04

Repositorio da Universidade da Coruña

On the performance of phonetic algorithms in microtext normalization

Author: Doval Yerai
Vilares Jesús
Vilares Manuel
Publication venue
Publication date: 04/02/2024
Field of study

User-generated content published on microblogging social networks constitutes a priceless source of information. However, microtexts usually deviate from the standard lexical and grammatical rules of the language, thus making its processing by traditional intelligent systems very difficult. As an answer, microtext normalization consists in transforming those non-standard microtexts into standard well-written texts as a preprocessing step, allowing traditional approaches to continue with their usual processing. Given the importance of phonetic phenomena in non-standard text formation, an essential element of the knowledge base of a normalizer would be the phonetic rules that encode these phenomena, which can be found in the so-called phonetic algorithms. In this work we experiment with a wide range of phonetic algorithms for the English language. The aim of this study is to determine the best phonetic algorithms within the context of candidate generation for microtext normalization. In other words, we intend to find those algorithms that taking as input non-standard terms to be normalized allow us to obtain as output the smallest possible sets of normalization candidates which still contain the corresponding target standard words. As it will be stated, the choice of the phonetic algorithm will depend heavily on the capabilities of the candidate selection mechanism which we usually find at the end of a microtext normalization pipeline. The faster it can make the right choices among big enough sets of candidates, the more we can sacrifice on the precision of the phonetic algorithms in favour of coverage in order to increase the overall performance of the normalization system. KEYWORDS: microtext normalization; phonetic algorithm; fuzzy matching; Twitter; textingComment: Accepted for publication in journal Expert Systems with Application

arXiv.org e-Print Archive

Introducción a la tarea compartida Tweet-Norm 2013: Normalización léxica de tuits en español

Author: Alegria Iñaki
Aranberri Nora
Fresno Víctor
Padró Lluís
Samallo Pablo
San Vicente Iñaki
Turmo Borras Jorge
Zubiaga Arkaitz
Publication venue
Publication date: 01/01/2013
Field of study

En este artículo se presenta una introducción a la tarea Tweet-Norm 2013 : descripción, corpora, anotación, preproceso, sistemas presentados y resultados obtenidos.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

TweetNorm: a benchmark for lexical normalization of spanish tweets

Author: Alegria Iñaki
Aranberri Nora
Comas Umbert Pere Ramon
Fresno Víctor
Gamallo Pablo
Padró Lluís
San Vicente Roncal Iñaki
Turmo Borras Jorge
Zubiaga Arkaitz
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

The language used in social media is often characterized by the abundance of informal and non-standard writing. The normalization of this non-standard language can be crucial to facilitate the subsequent textual processing and to consequently help boost the performance of natural language processing tools applied to social media text. In this paper we present a benchmark for lexical normalization of social media posts, specifically for tweets in Spanish language. We describe the tweet normalization challenge we organized recently, analyze the performance achieved by the different systems submitted to the challenge, and delve into the characteristics of systems to identify the features that were useful. The organization of this challenge has led to the production of a benchmark for lexical normalization of social media, including an evaluation framework, as well as an annotated corpus of Spanish tweets-TweetNorm_es-, which we make publicly available. The creation of this benchmark and the evaluation has brought to light the types of words that submitted systems did best with, and posits the main shortcomings to be addressed in future work.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

UPCommons. Portal del coneixement obert de la UPC

Warwick Research Archives Portal Repository

Multilingual sentiment analysis in social media.

Author: San Vicente Roncal Iñaki
Publication venue
Publication date: 11/03/2019
Field of study

252 p.This thesis addresses the task of analysing sentiment in messages coming from social media. The ultimate goal was to develop a Sentiment Analysis system for Basque. However, because of the socio-linguistic reality of the Basque language a tool providing only analysis for Basque would not be enough for a real world application. Thus, we set out to develop a multilingual system, including Basque, English, French and Spanish.The thesis addresses the following challenges to build such a system:- Analysing methods for creating Sentiment lexicons, suitable for less resourced languages.- Analysis of social media (specifically Twitter): Tweets pose several challenges in order to understand and extract opinions from such messages. Language identification and microtext normalization are addressed.- Research the state of the art in polarity classification, and develop a supervised classifier that is tested against well known social media benchmarks.- Develop a social media monitor capable of analysing sentiment with respect to specific events, products or organizations

Archivo Digital para la Docencia y la Investigación

Multilingual sentiment analysis in social media.

Author: San Vicente Roncal Iñaki
Publication venue
Publication date: 01/01/2019
Field of study

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital para la Docencia y la Investigación