68 research outputs found
El modelo probabilístico: características y modelos derivados
A review of the art of the family of probabilistic model of information retrieval is presented. Based on its basic priciples, difefferent specific models are analysed: the BIR model –the most basic one– the classic BM25 and the DFR models, one of the last developed.Presentamos en este trabajo una revisión del estado del arte de la familia de los modelos probabilísticas de recuperación de información. Partiendo de los principios básicos que sustentan estos modelos, estudiaremos diferentes modelos concretos: el modelo de independencia binaria –el más básico–, el ya clásico BM25 y, finalmente, los modelos DFR –uno de los últimos desarrollos–
On the performance of phonetic algorithms in microtext normalization
User-generated content published on microblogging social networks constitutes
a priceless source of information. However, microtexts usually deviate from the
standard lexical and grammatical rules of the language, thus making its
processing by traditional intelligent systems very difficult. As an answer,
microtext normalization consists in transforming those non-standard microtexts
into standard well-written texts as a preprocessing step, allowing traditional
approaches to continue with their usual processing. Given the importance of
phonetic phenomena in non-standard text formation, an essential element of the
knowledge base of a normalizer would be the phonetic rules that encode these
phenomena, which can be found in the so-called phonetic algorithms.
In this work we experiment with a wide range of phonetic algorithms for the
English language. The aim of this study is to determine the best phonetic
algorithms within the context of candidate generation for microtext
normalization. In other words, we intend to find those algorithms that taking
as input non-standard terms to be normalized allow us to obtain as output the
smallest possible sets of normalization candidates which still contain the
corresponding target standard words. As it will be stated, the choice of the
phonetic algorithm will depend heavily on the capabilities of the candidate
selection mechanism which we usually find at the end of a microtext
normalization pipeline. The faster it can make the right choices among big
enough sets of candidates, the more we can sacrifice on the precision of the
phonetic algorithms in favour of coverage in order to increase the overall
performance of the normalization system.
KEYWORDS: microtext normalization; phonetic algorithm; fuzzy matching;
Twitter; textingComment: Accepted for publication in journal Expert Systems with Application
On the performance of phonetic algorithms in microtext normalization
© 2018. This manuscript version is made available under the CC-BY-NC-ND 4.0 license
https://creativecommons.org/licenses/by-nc-nd/4.0/. This version of the article: Doval,
Y., Vilares, M. and Vilares, J. (2018) ‘On the performance of phonetic algorithms in
microtext normalization’ has been accepted for publication in: Expert Systems with
Applications, 113, pp. 213–222. The Version of Record is available online at: https://doi.org/10.1016/j.eswa.2018.07.016[Abstract]: User–generated content published on microblogging social networks constitutes a priceless source of information. However, microtexts usually deviate from the standard lexical and grammatical rules of the language, thus making its processing by traditional intelligent systems very difficult. As an answer, microtext normalization consists in transforming those non–standard microtexts into standard well–written texts as a preprocessing step, allowing traditional approaches to continue with their usual processing. Given the importance of phonetic phenomena in non–standard text formation, an essential element of the knowledge base of a normalizer would be the phonetic rules that encode these phenomena, which can be found in the so–called phonetic algorithms.
In this work we experiment with a wide range of phonetic algorithms for the English language. The aim of this study is to determine the best phonetic algorithms within the context of candidate generation for microtext normalization. In other words, we intend to find those algorithms that taking as input non–standard terms to be normalized allow us to obtain as output the smallest possible sets of normalization candidates which still contain the corresponding target standard words. As it will be stated, the choice of the phonetic algorithm will depend heavily on the capabilities of the candidate selection mechanism which we usually find at the end of a microtext normalization pipeline. The faster it can make the right choices among big enough sets of candidates, the more we can sacrifice on the precision of the phonetic algorithms in favour of coverage in order to increase the overall performance of the normalization system.This research has been partially funded by the Spanish Ministry of Economy, Industry and Competitiveness (MINECO) through projects TIN2017-85160-C2-1-R, TIN2017-85160-C2-2-R, FFI2014-51978-C2-1-R and FFI2014-51978-C2-2-R, and by the Autonomous Government of Galicia through projects ED431D-2017/12, ED431B-2017/01 and ED431D R2016/046. Moreover, Yerai Doval is funded by the Spanish State Secretariat for Research, Development and Innovation (which belongs to MINECO) and by the European Social Fund (ESF) under a FPI fellowship (BES-2015-073768) associated to project FFI2014-51978-C2-1-R.Xunta de Galicia; ED431D-2017/12Xunta de Galicia; ED431B-2017/01Xunta de Galicia; ED431D R2016/04
On the performance of phonetic algorithms in microtext normalization
User–generated content published on microblogging social networks constitutes a priceless source of information. However, microtexts usually deviate from the standard lexical and grammatical rules of the language, thus making its processing by traditional intelligent systems very difficult. As an answer, microtext normalization consists in transforming those non–standard microtexts into standard well–written texts as a preprocessing step, allowing traditional approaches to continue with their usual processing. Given the importance of phonetic phenomena in non–standard text formation, an essential element of the knowledge base of a normalizer would be the phonetic rules that encode these phenomena, which can be found in the so–called phonetic algorithms. In this work we experiment with a wide range of phonetic algorithms for the English language. The aim of this study is to determine the best phonetic algorithms within the context of candidate generation for microtext normalization. In other words, we intend to find those algorithms that taking as input non–standard terms to be normalized allow us to obtain as output the smallest possible sets of normalization candidates which still contain the corresponding target standard words. As it will be stated, the choice of the phonetic algorithm will depend heavily on the capabilities of the candidate selection mechanism which we usually find at the end of a microtext normalization pipeline. The faster it can make the right choices among big enough sets of candidates, the more we can sacrifice on the precision of the phonetic algorithms in favour of coverage in order to increase the overall performance of the normalization systemAgencia Estatal de Investigación | Ref. TIN2017-85160-C2-1-RAgencia Estatal de Investigación | Ref. TIN2017-85160-C2-2-RMinisterio de Economía y Competitividad | Ref. FFI2014-51978-C2-1-RMinisterio de Economía y Competitividad | Ref. FFI2014-51978-C2-2-RXunta de Galicia | Ref. ED431D-2017/12Xunta de Galicia | Ref. ED431B2017/01Xunta de Galicia | Ref. ED431D R2016/046Ministerio de Economía y Competitividad | Ref. BES-2015-07376
Misspelled queries in cross-language IR: analysis and management
Este artículo estudia el impacto de los errores ortográficos en las consultas sobre el rendimiento de los sistemas de recuperación de información multilingüe, proponiendo dos estrategias para su tratamiento: el empleo de técnicas de corrección ortográfica automática y la utilización de n-gramas de caracteres como términos índice y unidad de traducción, para así aprovecharnos de su robustez inherente. Los resultados demuestran la sensibilidad de estos sistemas frente a dichos errores así como la efectividad de las soluciones propuestas. Hasta donde alcanza nuestro conocimiento no existen trabajos similares en el ámbito multilingüe.This paper studies the impact of misspelled queries on the performance of Cross-Language Information Retrieval systems and proposes two strategies for dealing with them: the use of automatic spelling correction techniques and the use of character n-grams both as index terms and translation units, thus allowing to take advantage of their inherent robustness. Our results demonstrate the sensitivity of these systems to such errors and the effectiveness of the proposed solutions. To the best of our knowledge there are no similar jobs in the cross-language field.Trabajo parcialmente subvencionado por el Ministerio de Economía y Competitividad y FEDER (proyectos TIN2010-18552-C03-01 y TIN2010-18552-C03-02) y por la Xunta de Galicia (ayudas CN 2012/008, CN 2012/317 y CN 2012/319)
Corrupted queries in text retrieval
En este artículo proponemos dos alternativas para el tratamiento de consultas degradadas en aplicaciones de Recuperación de Información en español. La primera de ellas es una estrategia basada en n-gramas de caracteres e independiente del conocimiento y recursos lingüísticos disponibles. Como segunda alternativa, proponemos a su vez dos técnicas de corrección ortográfica, integrando una de ellas un modelo estocástico que debe ser entrenado previamente a partir de un texto etiquetado.
Con el fin de estudiar su validez, se ha diseñado un marco de pruebas sobre el que se han evaluado ambas aproximaciones.In this paper, we propose two different alternatives to deal with degraded queries on Spanish Information Retrieval applications. The first is based on character
n-grams, and has no dependence on the linguistic knowledge and resources available.
In the second, we propose two spelling correction techniques, one of which has a
strong dependence on a stochastic model that must be previously built from a PoStagged corpus. In order to study their validity, a testing framework has been designed and applied on both approaches for evaluation.Este trabajo ha sido parcialmente subvencionado por el Ministerio de Educación y Ciencia y FEDER (a través de los proyectos de investigación HUM2007-66607-C04-02 y HUM2007-66607-C04-03), y por la Xunta de Galicia (a través de los proyectos 05PXIC30501PN, 07SIN005206PR, INCITE07PXI104119ES y la ”Red Gallega de PLN y RI”)
Adaptive scheduling for adaptive sampling in pos taggers construction
We introduce an adaptive scheduling for adaptive sampling as a novel way of machine learning in the construction of part-of-speech taggers. The goal is to speed up the training on large data sets, without significant loss of performance with regard to an optimal configuration. In contrast to previous methods using a random, fixed or regularly rising spacing between the instances, ours analyzes the shape of the learning curve geometrically in conjunction with a functional model to increase or decrease it at any time. The algorithm proves to be formally correct regarding our working hypotheses. Namely, given a case, the following one is the nearest ensuring a net gain of learning ability from the former, it being possible to modulate the level of requirement for this condition. We also improve the robustness of sampling by paying greater attention to those regions of the training data base subject to a temporary inflation in performance, thus preventing the learning from stopping prematurely. The proposal has been evaluated on the basis of its reliability to identify the convergence of models, corroborating our expectations. While a concrete halting condition is used for testing, users can choose any condition whatsoever to suit their own specific needs.Agencia Estatal de Investigación | Ref. TIN2017-85160-C2-1-RAgencia Estatal de Investigación | Ref. TIN2017-85160-C2-2-RXunta de Galicia | Ref. ED431C 2018/50Xunta de Galicia | Ref. ED431D 2017/1
TIR over Egyptian Hieroglyphs
We would like to thank Dr. Josep Cervello Autuori, Director of the Institut d’Estudis del Proxim Orient Antic (IEPOA) of the Universitat Autonoma de Barcelona, for introducing us to Egyptian; and Dr. Serge Rosmorduc, Associate of the Conservatoire National des Arts et Metiers (CNAM), for his support with JSESH[Abstract] This work presents an Information Retrieval system specifically designed to manage Ancient Egyptian hieroglyphic texts taking into account their peculiarities both at lexical and at encoding level for its application in Egyptology and Digital Heritage. The tool has been made freely available to the research community under a free license and, to the best of our knowledge, it is the first tool of its kindMinisterio de Economía y Competitividad; FFI2014-51978-C2-2-
Sentiment Analysis for Fake News Detection
[Abstract] In recent years, we have witnessed a rise in fake news, i.e., provably false pieces of information created with the intention of deception. The dissemination of this type of news poses a serious threat to cohesion and social well-being, since it fosters political polarization and the distrust of people with respect to their leaders. The huge amount of news that is disseminated through social media makes manual verification unfeasible, which has promoted the design and implementation of automatic systems for fake news detection. The creators of fake news use various stylistic tricks to promote the success of their creations, with one of them being to excite the sentiments of the recipients. This has led to sentiment analysis, the part of text analytics in charge of determining the polarity and strength of sentiments expressed in a text, to be used in fake news detection approaches, either as a basis of the system or as a complementary element. In this article, we study the different
uses of sentiment analysis in the detection of fake news, with a discussion of the most relevant elements and shortcomings, and the requirements that should be met in the near future, such as multilingualism, explainability, mitigation of biases, or treatment of multimedia elements.Xunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431C 2020/11This work has been funded by FEDER/Ministerio de Ciencia, Innovación y Universidades — Agencia Estatal de Investigación through the ANSWERASAP project (TIN2017-85160-C2-1-R); and by Xunta de Galicia through a Competitive Reference Group grant (ED431C 2020/11). CITIC, as Research Center of the Galician University System, is funded by the Consellería de Educación, Universidade e Formación Profesional of the Xunta de Galicia through the European Regional Development Fund (ERDF/FEDER) with 80%, the Galicia ERDF 2014-20 Operational Programme, and the remaining 20% from the Secretaría Xeral de Universidades (ref. ED431G 2019/01). David Vilares is also supported by a 2020 Leonardo Grant for Researchers and Cultural Creators from the BBVA Foundation. Carlos Gómez-Rodríguez has also received funding from the European Research Council (ERC), under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant No. 714150
Any papyrus about "a hand over a stool and a bread loaf, followed by a boat"? Dealing with hieroglyphic texts in IR
We would like to thank Dr. Josep Cervell ́o Autuori, Director of the Institut d’Estudis del Pr`oxim Orient Antic (IEPOA) of the Universitat Aut`onoma de Barcelona for introducing us to the Ancient Egyptian language and acting as our fictional client. We would also like to thank Dr. Serge Rosmorduc, Associate of the Conservatoire National des Arts et M ́etiers (CNAM) for all his support when working with JSesh.[Abstract] Digital Heritage deals with the use of computing and information technologies for the preservation and study of the human cultural legacy. Within this context, we present here a Text Retrieval system developed specifically to work with Egyptian hieroglyphic texts for its use by Egyptologists and Linguists in the study and preservation of Ancient Egyptian scripts. We intend to make it freely available to the Egyptology research community. To the best of our knowledge this is the first tool of its kind.Ministerio de Economía y Competitividad; FFI2014-51978-C2-2-
- …