68 research outputs found

    Adaptive scheduling for adaptive sampling in pos taggers construction

    Get PDF
    We introduce an adaptive scheduling for adaptive sampling as a novel way of machine learning in the construction of part-of-speech taggers. The goal is to speed up the training on large data sets, without significant loss of performance with regard to an optimal configuration. In contrast to previous methods using a random, fixed or regularly rising spacing between the instances, ours analyzes the shape of the learning curve geometrically in conjunction with a functional model to increase or decrease it at any time. The algorithm proves to be formally correct regarding our working hypotheses. Namely, given a case, the following one is the nearest ensuring a net gain of learning ability from the former, it being possible to modulate the level of requirement for this condition. We also improve the robustness of sampling by paying greater attention to those regions of the training data base subject to a temporary inflation in performance, thus preventing the learning from stopping prematurely. The proposal has been evaluated on the basis of its reliability to identify the convergence of models, corroborating our expectations. While a concrete halting condition is used for testing, users can choose any condition whatsoever to suit their own specific needs.Agencia Estatal de Investigación | Ref. TIN2017-85160-C2-1-RAgencia Estatal de Investigación | Ref. TIN2017-85160-C2-2-RXunta de Galicia | Ref. ED431C 2018/50Xunta de Galicia | Ref. ED431D 2017/1

    On the performance of phonetic algorithms in microtext normalization

    Get PDF
    User–generated content published on microblogging social networks constitutes a priceless source of information. However, microtexts usually deviate from the standard lexical and grammatical rules of the language, thus making its processing by traditional intelligent systems very difficult. As an answer, microtext normalization consists in transforming those non–standard microtexts into standard well–written texts as a preprocessing step, allowing traditional approaches to continue with their usual processing. Given the importance of phonetic phenomena in non–standard text formation, an essential element of the knowledge base of a normalizer would be the phonetic rules that encode these phenomena, which can be found in the so–called phonetic algorithms. In this work we experiment with a wide range of phonetic algorithms for the English language. The aim of this study is to determine the best phonetic algorithms within the context of candidate generation for microtext normalization. In other words, we intend to find those algorithms that taking as input non–standard terms to be normalized allow us to obtain as output the smallest possible sets of normalization candidates which still contain the corresponding target standard words. As it will be stated, the choice of the phonetic algorithm will depend heavily on the capabilities of the candidate selection mechanism which we usually find at the end of a microtext normalization pipeline. The faster it can make the right choices among big enough sets of candidates, the more we can sacrifice on the precision of the phonetic algorithms in favour of coverage in order to increase the overall performance of the normalization systemAgencia Estatal de Investigación | Ref. TIN2017-85160-C2-1-RAgencia Estatal de Investigación | Ref. TIN2017-85160-C2-2-RMinisterio de Economía y Competitividad | Ref. FFI2014-51978-C2-1-RMinisterio de Economía y Competitividad | Ref. FFI2014-51978-C2-2-RXunta de Galicia | Ref. ED431D-2017/12Xunta de Galicia | Ref. ED431B2017/01Xunta de Galicia | Ref. ED431D R2016/046Ministerio de Economía y Competitividad | Ref. BES-2015-07376

    Corrupted queries in text retrieval

    Get PDF
    En este artículo proponemos dos alternativas para el tratamiento de consultas degradadas en aplicaciones de Recuperación de Información en español. La primera de ellas es una estrategia basada en n-gramas de caracteres e independiente del conocimiento y recursos lingüísticos disponibles. Como segunda alternativa, proponemos a su vez dos técnicas de corrección ortográfica, integrando una de ellas un modelo estocástico que debe ser entrenado previamente a partir de un texto etiquetado. Con el fin de estudiar su validez, se ha diseñado un marco de pruebas sobre el que se han evaluado ambas aproximaciones.In this paper, we propose two different alternatives to deal with degraded queries on Spanish Information Retrieval applications. The first is based on character n-grams, and has no dependence on the linguistic knowledge and resources available. In the second, we propose two spelling correction techniques, one of which has a strong dependence on a stochastic model that must be previously built from a PoStagged corpus. In order to study their validity, a testing framework has been designed and applied on both approaches for evaluation.Este trabajo ha sido parcialmente subvencionado por el Ministerio de Educación y Ciencia y FEDER (a través de los proyectos de investigación HUM2007-66607-C04-02 y HUM2007-66607-C04-03), y por la Xunta de Galicia (a través de los proyectos 05PXIC30501PN, 07SIN005206PR, INCITE07PXI104119ES y la ”Red Gallega de PLN y RI”)

    On the performance of phonetic algorithms in microtext normalization

    Get PDF
    © 2018. This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/. This version of the article: Doval, Y., Vilares, M. and Vilares, J. (2018) ‘On the performance of phonetic algorithms in microtext normalization’ has been accepted for publication in: Expert Systems with Applications, 113, pp. 213–222. The Version of Record is available online at: https://doi.org/10.1016/j.eswa.2018.07.016[Abstract]: User–generated content published on microblogging social networks constitutes a priceless source of information. However, microtexts usually deviate from the standard lexical and grammatical rules of the language, thus making its processing by traditional intelligent systems very difficult. As an answer, microtext normalization consists in transforming those non–standard microtexts into standard well–written texts as a preprocessing step, allowing traditional approaches to continue with their usual processing. Given the importance of phonetic phenomena in non–standard text formation, an essential element of the knowledge base of a normalizer would be the phonetic rules that encode these phenomena, which can be found in the so–called phonetic algorithms. In this work we experiment with a wide range of phonetic algorithms for the English language. The aim of this study is to determine the best phonetic algorithms within the context of candidate generation for microtext normalization. In other words, we intend to find those algorithms that taking as input non–standard terms to be normalized allow us to obtain as output the smallest possible sets of normalization candidates which still contain the corresponding target standard words. As it will be stated, the choice of the phonetic algorithm will depend heavily on the capabilities of the candidate selection mechanism which we usually find at the end of a microtext normalization pipeline. The faster it can make the right choices among big enough sets of candidates, the more we can sacrifice on the precision of the phonetic algorithms in favour of coverage in order to increase the overall performance of the normalization system.This research has been partially funded by the Spanish Ministry of Economy, Industry and Competitiveness (MINECO) through projects TIN2017-85160-C2-1-R, TIN2017-85160-C2-2-R, FFI2014-51978-C2-1-R and FFI2014-51978-C2-2-R, and by the Autonomous Government of Galicia through projects ED431D-2017/12, ED431B-2017/01 and ED431D R2016/046. Moreover, Yerai Doval is funded by the Spanish State Secretariat for Research, Development and Innovation (which belongs to MINECO) and by the European Social Fund (ESF) under a FPI fellowship (BES-2015-073768) associated to project FFI2014-51978-C2-1-R.Xunta de Galicia; ED431D-2017/12Xunta de Galicia; ED431B-2017/01Xunta de Galicia; ED431D R2016/04

    Modeling of learning curves with applications to POS tagging

    Get PDF
    An algorithm to estimate the evolution of learning curves on the whole of a training data base, based on the results obtained from a portion and using a functional strategy, is introduced. We approximate iteratively the sought value at the desired time, independently of the learning technique used and once a point in the process, called prediction level, has been passed. The proposal proves to be formally correct with respect to our working hypotheses and includes a reliable proximity condition. This allows the user to fix a convergence threshold with respect to the accuracy finally achievable, which extends the concept of stopping criterion and seems to be effective even in the presence of distorting observations. Our aim is to evaluate the training effort, supporting decision making in order to reduce the need for both human and computational resources during the learning process. The proposal is of interest in at least three operational procedures. The first is the anticipation of accuracy gain, with the purpose of measuring how much work is needed to achieve a certain degree of performance. The second relates the comparison of efficiency between systems at training time, with the objective of completing this task only for the one that best suits our requirements. The prediction of accuracy is also a valuable item of information for customizing systems, since we can estimate in advance the impact of settings on both the performance and the development costs. Using the generation of part-of-speech taggers as an example application, the experimental results are consistent with our expectations.Ministerio de Economía y Competitividad | Ref. FFI2014-51978-C2-1-

    GALENA: tabular DCG parsing for natural languages

    Get PDF
    [Abstract] We present a definite clause based parsing environment for natural languages, whose operational model is the dynamic interpretation of logical push-down automata. We attempt to briefly explain our design decisions in terms of a set of properties that practical natural language processing systems should incorporate. The aim is to show both the advantages and the drawbacks of our approach.España. Gobierno; HF96-36Xunta de Galcia; XUGA10505B96Xunta de Galcia; XUGA20402B9

    Early stopping by correlating online indicators in neural networks

    Get PDF
    Financiado para publicación en acceso aberto: Universidade de Vigo/CISUGinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016/TIN2017-85160-C2-2-R/ES/AVANCES EN NUEVOS SISTEMAS DE EXTRACCION DE RESPUESTAS CON ANALISIS SEMANTICO Y APRENDIZAJE PROFUNDOinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-113230RB-C22/ES/SEQUENCE LABELING MULTITASK MODELS FOR LINGUISTICALLY ENRICHED NER: SEMANTICS AND DOMAIN ADAPTATION (SCANNER-UVIGO)In order to minimize the generalization error in neural networks, a novel technique to identify overfitting phenomena when training the learner is formally introduced. This enables support of a reliable and trustworthy early stopping condition, thus improving the predictive power of that type of modeling. Our proposal exploits the correlation over time in a collection of online indicators, namely characteristic functions for indicating if a set of hypotheses are met, associated with a range of independent stopping conditions built from a canary judgment to evaluate the presence of overfitting. That way, we provide a formal basis for decision making in terms of interrupting the learning process. As opposed to previous approaches focused on a single criterion, we take advantage of subsidiarities between independent assessments, thus seeking both a wider operating range and greater diagnostic reliability. With a view to illustrating the effectiveness of the halting condition described, we choose to work in the sphere of natural language processing, an operational continuum increasingly based on machine learning. As a case study, we focus on parser generation, one of the most demanding and complex tasks in the domain. The selection of cross-validation as a canary function enables an actual comparison with the most representative early stopping conditions based on overfitting identification, pointing to a promising start toward an optimal bias and variance control.Agencia Estatal de Investigación | Ref. TIN2017-85160-C2-2-RAgencia Estatal de Investigación | Ref. PID2020-113230RB-C22Xunta de Galicia | Ref. ED431C 2018/5

    Surfing the modeling of pos taggers in low-resource scenarios

    Get PDF
    The recent trend toward the application of deep structured techniques has revealed the limits of huge models in natural language processing. This has reawakened the interest in traditional machine learning algorithms, which have proved still to be competitive in certain contexts, particularly in low-resource settings. In parallel, model selection has become an essential task to boost performance at reasonable cost, even more so when we talk about processes involving domains where the training and/or computational resources are scarce. Against this backdrop, we evaluate the early estimation of learning curves as a practical mechanism for selecting the most appropriate model in scenarios characterized by the use of non-deep learners in resource-lean settings. On the basis of a formal approximation model previously evaluated under conditions of wide availability of training and validation resources, we study the reliability of such an approach in a different and much more demanding operational environment. Using as a case study the generation of pos taggers for Galician, a language belonging to the Western Ibero-Romance group, the experimental results are consistent with our expectations.Ministerio de Ciencia e Innovación | Ref. PID2020-113230RB-C21Ministerio de Ciencia e Innovación | Ref. PID2020-113230RB-C22Xunta de Galicia | Ref. ED431C 2020/1

    A tagger environment for Galician

    Get PDF
    [Abstract] In this paper, we introduce a tagger environment for Galician, the native language of Galicia. Galician belongs to the group of Romance languages which developed from the Latin imposed on the north-west of the Iberian Peninsula by the Romans, with additions from the languages of peoples living here before the colonization, as well as contributions from other languages subsequent to the breaking-up of the Roman Empire. Various historical circumstances led to its not becoming a State language and although it was relegated to informal usage, our vernacular has managed to survive well into the twentieth century when, parallel to the recovery of the institutions for self-government, Galician was once again granted the status of official language for Galicia, together with the Spanish language. From an operational point of view, our proposal is based on the notion of finite automaton, separating the execution strategy from the implementation of the tagging interpreter. That facilitates the maintenance at the time that assures the robustness of the architecture. Empirical tests prove the validity of our approach to deal with a language whose morphology is non-trivial.España. Gobierno; HF97-223Xunta de Galcia; XUGA10505B96Xunta de Galcia; XUGA20402B97

    Une approche formelle pour la génération d'analyseurs de langages naturels

    Get PDF
    [Abstract] Un processus d'analyse syntaxique et d'annotation efficace est déterminante dans l'élaboration de structures d'analyse de langages naturels. Ce papier introduit un environnement de programmation permettant l'implémentation du support formel des langages naturels depuis deux points de vue, analyse syntaxique et annotation. Le problème de l'analyse syntaxique se pose dans le domaine de l'analyse de grammaires algébriques sans restrictions, et celui de l'annotation dans le contexte des automates finis non déterministes. L'analyseur syntaxique prends en entrée un texte arbitraire, suivant la structure désignée par une grammaire algébrique. La structure de la forêt partagée résultante est étudiée par rapport à l'optimisation du partage syntaxique, de façon à favoriser l'élimination des ambigüités pendant le processus sémantique. Les automates à états finis sont utilisés comme formalisme opérationnel pour annoter les corpora de façon efficace, spécialement pour les langages autres que l'Anglais, pour lesquels l'analyse morphologique a une relevance accrue. Les deux activités, analyse syntaxique et annotation, sont intégrées dans un même outil, qui a pour nom Galena (pour Generador de Analizadores para Lenguages Naturales), fournissant l'incrémentalité comme fonctionnalité favorisant la réutilisabilité des composantes d'un point de vue génie logiciel.Xunta de Galcia; XUGA10501A9
    corecore