    Surfing the modeling of pos taggers in low-resource scenarios

    The recent trend toward the application of deep structured techniques has revealed the limits of huge models in natural language processing. This has reawakened the interest in traditional machine learning algorithms, which have proved still to be competitive in certain contexts, particularly in low-resource settings. In parallel, model selection has become an essential task to boost performance at reasonable cost, even more so when we talk about processes involving domains where the training and/or computational resources are scarce. Against this backdrop, we evaluate the early estimation of learning curves as a practical mechanism for selecting the most appropriate model in scenarios characterized by the use of non-deep learners in resource-lean settings. On the basis of a formal approximation model previously evaluated under conditions of wide availability of training and validation resources, we study the reliability of such an approach in a different and much more demanding operational environment. Using as a case study the generation of pos taggers for Galician, a language belonging to the Western Ibero-Romance group, the experimental results are consistent with our expectations.Ministerio de Ciencia e Innovación | Ref. PID2020-113230RB-C21Ministerio de Ciencia e Innovación | Ref. PID2020-113230RB-C22Xunta de Galicia | Ref. ED431C 2020/1

    Exploiting Transitivity in Probabilistic Models for Ontology Learning

    Capturing word meaning is one of the challenges of natural language processing (NLP). Formal models of meaning such as ontologies are knowledge repositories used in a variety of applications. To be effectively used, these ontologies have to be large or, at least, adapted to specific domains. Our main goal is to contribute practically to the research on ontology learning models by covering different aspects of the task. We propose probabilistic models for learning ontologies that expands existing ontologies taking into accounts both corpus-extracted evidences and structure of the generated ontologies. The model exploits structural properties of target relations such as transitivity during learning. We then propose two extensions of our probabilistic models: a model for learning from a generic domain that can be exploited to extract new information in a specific domain and an incremental ontology learning system that put human validations in the learning loop. This latter provides a graphical user interface and a human-computer interaction workflow supporting the incremental leaning loop

    Investigating the Effectiveness of Semantic Tagging in Sense Disambiguation of Specialized Homographs from the perspective of F-Measure in Retrieving scientific texts

    The aim of this study was to explain the application of text corpus tagging method in Sense disambiguation from specialized homographs and increasing the retrieval F-Measure of scientific texts containing such homographs.This is an experimental study. Specialized homographs were identified by direct observation and morphological analysis of the word. The research sample consisted of 442 scientific articles of two groups of experimental group and control group. The control group had 221 full-text articles without tags and the experimental group had same 221 tagged articles, which were tested in the information retrieval system to measure the effectiveness of tagging in word sense disambiguation from specialized homographs.The level of significance of the Wilcoxon signed-rank test showed that the F-Measure of retrieval results of specialized homographs after using the tagged specialized text corpus in the information retrieval system is significantly different than before. Examination of negative and positive rankings showed that the F-Measure of the results after using the tagged specialized text corpus has increased significantly and has reached its maximum level of 1.The findings of the present study showed that there is not necessarily an inverse relationship between recall and precision, and the two can reach their maximum level of 1. The better efficiency of the retrieval system using this approach is due to the empowerment of the retrieval system in distinguishing between specialized homographs and identifying their semantic roles by using semantic tags as training data that were considered in the test and training set. Embedding the training set in the structure of the retrieval system provides additional information to serve the retrieval system to distinguish between the various meanings of specialized homographs. This tool is one of the elements that causes the optimal quality of retrieval and leads the information retrieval system from word-driven retrieval to content-driven retrieval when retrieving texts containing specialized homographs

    A literature survey of active machine learning in the context of natural language processing

    Active learning is a supervised machine learning technique in which the learner is in control of the data used for learning. That control is utilized by the learner to ask an oracle, typically a human with extensive knowledge of the domain at hand, about the classes of the instances for which the model learned so far makes unreliable predictions. The active learning process takes as input a set of labeled examples, as well as a larger set of unlabeled examples, and produces a classifier and a relatively small set of newly labeled data. The overall goal is to create as good a classifier as possible, without having to mark-up and supply the learner with more data than necessary. The learning process aims at keeping the human annotation effort to a minimum, only asking for advice where the training utility of the result of such a query is high. Active learning has been successfully applied to a number of natural language processing tasks, such as, information extraction, named entity recognition, text categorization, part-of-speech tagging, parsing, and word sense disambiguation. This report is a literature survey of active learning from the perspective of natural language processing

    Modeling of learning curves with applications to POS tagging

    An algorithm to estimate the evolution of learning curves on the whole of a training data base, based on the results obtained from a portion and using a functional strategy, is introduced. We approximate iteratively the sought value at the desired time, independently of the learning technique used and once a point in the process, called prediction level, has been passed. The proposal proves to be formally correct with respect to our working hypotheses and includes a reliable proximity condition. This allows the user to fix a convergence threshold with respect to the accuracy finally achievable, which extends the concept of stopping criterion and seems to be effective even in the presence of distorting observations. Our aim is to evaluate the training effort, supporting decision making in order to reduce the need for both human and computational resources during the learning process. The proposal is of interest in at least three operational procedures. The first is the anticipation of accuracy gain, with the purpose of measuring how much work is needed to achieve a certain degree of performance. The second relates the comparison of efficiency between systems at training time, with the objective of completing this task only for the one that best suits our requirements. The prediction of accuracy is also a valuable item of information for customizing systems, since we can estimate in advance the impact of settings on both the performance and the development costs. Using the generation of part-of-speech taggers as an example application, the experimental results are consistent with our expectations.Ministerio de Economía y Competitividad | Ref. FFI2014-51978-C2-1-

    Doctor of Philosophy

    dissertationDomain adaptation of natural language processing systems is challenging because it requires human expertise. While manual e ort is e ective in creating a high quality knowledge base, it is expensive and time consuming. Clinical text adds another layer of complexity to the task due to privacy and con dentiality restrictions that hinder the ability to share training corpora among di erent research groups. Semantic ambiguity is a major barrier for e ective and accurate concept recognition by natural language processing systems. In my research I propose an automated domain adaptation method that utilizes sublanguage semantic schema for all-word word sense disambiguation of clinical narrative. According to the sublanguage theory developed by Zellig Harris, domain-speci c language is characterized by a relatively small set of semantic classes that combine into a small number of sentence types. Previous research relied on manual analysis to create language models that could be used for more e ective natural language processing. Building on previous semantic type disambiguation research, I propose a method of resolving semantic ambiguity utilizing automatically acquired semantic type disambiguation rules applied on clinical text ambiguously mapped to a standard set of concepts. This research aims to provide an automatic method to acquire Sublanguage Semantic Schema (S3) and apply this model to disambiguate terms that map to more than one concept with di erent semantic types. The research is conducted using unmodi ed MetaMap version 2009, a concept recognition system provided by the National Library of Medicine, applied on a large set of clinical text. The project includes creating and comparing models, which are based on unambiguous concept mappings found in seventeen clinical note types. The e ectiveness of the nal application was validated through a manual review of a subset of processed clinical notes using recall, precision and F-score metrics