1,950 research outputs found

    Domain adaptation with minimal training

    Get PDF
    The performance of a machine learning model trained on labeled data of a (source) domain degrades severely when they are tested on a different (target) domain. Traditional approaches deal with this problem by training a new model for every target domain. In natural language processing, top performing systems often use multiple interconnected models; therefore training all of them for every target domain is computationally expensive. Moreover, retraining the model for the target domain requires access to the labeled data from the source domain which may not be available to end users due to copyright issues. This thesis is a study on how to adapt to a target domain, using the system trained on source domain and avoiding the cost of retraining and the need for access to the source labeled data. This thesis identifies two key ingredients for adaptation without training: broad coverage resources and constraints. We show how resources like Wikipedia, VerbNet and WordNet that contain comprehensive coverage of entities, semantic roles and words in English can help a model adapt to the target domain. For the task of semantic role labeling, we show that in the decision phase, we can replace a linguistic unit (e.g. verb, word) with another equivalent linguistic unit residing in the same cluster defined in these resources (e.g. VerbNet, WordNet) such that after replacement, text becomes more like text on which the model was trained. We show that the model's output is more accurate on the transformed text than on original text. In another instance, we show how to use a system for linking mentions to Wikipedia concepts for adaptation of a named entity recognition system. Since Wikipedia has a broad domain coverage, the linking system is robust across domain variations. Therefore, jointly performing entity recognition and linking improves the accuracy of entity recognition on the target domain without requiring training of a new system for the new domain. In all cases, we show how to use intuitive constraints to guide the model into making coherent predictions. We show how incorporating prior knowledge about a new domain as declarative constraints into the decision phase can improve performance of a model on the new domain. When such prior knowledge is unavailable, we show how to acquire knowledge automatically from unlabeled text from the new domain and domains similar to both source and target domains

    Semi-supervised method for biomedical event extraction

    Get PDF
    Introduction. In Colombia, malaria represents a serious public health problem. It is estimated that approximately 60% of the population is at risk of the disease.Objective. To describe the mortality trends for malaria in Colombia, from 1979 to 2008. Materials and methods. A descriptive study to determine the trends of the malaria mortality was carried out. The information sources used were databases of registered deaths and population projections from 1979 to 2008 of the National Statistics Department. The indicator used was the mortality rate. The trend was analyzed by join point regression.Results. Six thousands nine hundred and sixty five deaths caused by malaria were certified for an age-adjusted rate of 0.74 deaths/100.000 inhabitants for the study period. In 74.3% of the deaths, the parasite species was not mentioned. The trend in the mortality rate showed a statistically significant decreasing behavior, which was lower from the second half of the nineties as compared with that presented in the eighties.Conclusions. The magnitude of mortality by malaria in Colombia is not high, in spite of the evident underreporting. A marked downward trend was observed between 1979 and 2008. The information obtained from death certificates, along with that of the public health surveillance system will allow to modify the recommendations and improve the implementation of preventive and control measures to further reduce the mortality caused by malaria.Introducción. En Colombia, el paludismo representa un grave problema de salud pública. Se estima que, aproximadamente, 60 % de la población se encuentra en riesgo de enfermar o de morir por esta causa.Objetivo. Describir la tendencia de la mortalidad por paludismo en Colombia desde 1979 hasta 2008. Materiales y métodos. Se llevó a cabo un estudio descriptivo para determinar la tendencia de las tasas de mortalidad. Las fuentes de información fueron las bases de datos de las defunciones registradas y de las proyecciones de población de 1979 a 2008 del Departamento Nacional de Estadística (DANE). El indicador empleado fue la tasa de mortalidad. La tendencia se analizó mediante el software de análisis de regresión de puntos de inflexión (joinpoint).Resultados. Se certificaron 6.965 muertes por paludismo para una tasa ajustada por edad de 0,74 muertes por 100.000 habitantes para el periodo estudiado. En 74,3 % de las muertes, no se especificó la especie parasitaria. Las tasas de mortalidad por paludismo presentaron una tendencia decreciente estadísticamente significativa, que fue menor a partir de la segunda mitad de la década de los 90 en comparación con la presentada en la década de los 80.Conclusiones. La magnitud de la mortalidad por paludismo en Colombia no es grande, a pesar del evidente subregistro; se observó una tendencia descendente entre 1979 y 2008. La información derivada de los certificados de defunción, junto con la del sistema de vigilancia en salud pública, permitirá modificar las recomendaciones y mejorar la toma de medidas preventivas y de control pertinentes para continuar reduciendo la mortalidad causada por el paludismo

    Semi-supervised method for biomedical event extraction

    Full text link

    Semi-automatic conversion of BioProp semantic annotation to PASBio annotation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Semantic role labeling (SRL) is an important text analysis technique. In SRL, sentences are represented by one or more predicate-argument structures (PAS). Each PAS is composed of a predicate (verb) and several arguments (noun phrases, adverbial phrases, etc.) with different semantic roles, including main arguments (agent or patient) as well as adjunct arguments (time, manner, or location). PropBank is the most widely used PAS corpus and annotation format in the newswire domain. In the biomedical field, however, more detailed and restrictive PAS annotation formats such as PASBio are popular. Unfortunately, due to the lack of an annotated PASBio corpus, no publicly available machine-learning (ML) based SRL systems based on PASBio have been developed. In previous work, we constructed a biomedical corpus based on the PropBank standard called BioProp, on which we developed an ML-based SRL system, BIOSMILE. In this paper, we aim to build a system to convert BIOSMILE's BioProp annotation output to PASBio annotation. Our system consists of BIOSMILE in combination with a BioProp-PASBio rule-based converter, and an additional semi-automatic rule generator.</p> <p>Results</p> <p>Our first experiment evaluated our rule-based converter's performance independently from BIOSMILE performance. The converter achieved an F-score of 85.29%. The second experiment evaluated combined system (BIOSMILE + rule-based converter). The system achieved an F-score of 69.08% for PASBio's 29 verbs.</p> <p>Conclusion</p> <p>Our approach allows PAS conversion between BioProp and PASBio annotation using BIOSMILE alongside our newly developed semi-automatic rule generator and rule-based converter. Our system can match the performance of other state-of-the-art domain-specific ML-based SRL systems and can be easily customized for PASBio application development.</p
    corecore