205 research outputs found

    Semi-Supervised Event Extraction with Paraphrase Clusters

    Full text link
    Supervised event extraction systems are limited in their accuracy due to the lack of available training data. We present a method for self-training event extraction systems by bootstrapping additional training data. This is done by taking advantage of the occurrence of multiple mentions of the same event instances across newswire articles from multiple sources. If our system can make a highconfidence extraction of some mentions in such a cluster, it can then acquire diverse training examples by adding the other mentions as well. Our experiments show significant performance improvements on multiple event extractors over ACE 2005 and TAC-KBP 2015 datasets.Comment: NAACL 201

    Slot Filling

    Get PDF
    Slot filling (SF) is the task of automatically extracting facts about particular entities from unstructured text, and populating a knowledge base (KB) with these facts. These structured KBs enable applications such as structured web queries and question answering. SF is typically framed as a query-oriented setting of the related task of relation extraction. Throughout this thesis, we reflect on how SF is a task with many distinct problems. We demonstrate that recall is a major limiter on SF system performance. We contribute an analysis of typical SF recall loss, and find a substantial amount of loss occurs early in the SF pipeline. We confirm that accurate NER and coreference resolution are required for high-recall SF. We measure upper bounds using a naïve graph-based semi-supervised bootstrapping technique, and find that only 39% of results are reachable using a typical feature space. We expect that this graph-based technique will be directly useful for extraction, and this leads us to frame SF as a label propagation task. We focus on a detailed graph representation of the task which reflects the behaviour and assumptions we want to model based on our analysis, including modifying the label propagation process to model multiple types of label interaction. Analysing the graph, we find that a large number of errors occur in very close proximity to training data, and identify that this is of major concern for propagation. While there are some conflicts caused by a lack of sufficient disambiguating context—we explore adding additional contextual features to address this—many of these conflicts are caused by subtle annotation problems. We find that lack of a standard for how explicit expressions of relations must be in text makes consistent annotation difficult. Using a strict definition of explicitness results in 20% of correct annotations being removed from a standard dataset. We contribute several annotation-driven analyses of this problem, exploring the definition of slots and the effect of the lack of a concrete definition of explicitness: annotation schema do not detail how explicit expressions of relations need to be, and there is large scope for disagreement between annotators. Additionally, applications may require relatively strict or relaxed evidence for extractions, but this is not considered in annotation tasks. We demonstrate that annotators frequently disagree on instances, dependent on differences in annotator world knowledge and thresholds on making probabilistic inference. SF is fundamental to enabling many knowledge-based applications, and this work motivates modelling and evaluating SF to better target these tasks

    Novel Event Detection and Classification for Historical Texts

    Get PDF
    Event processing is an active area of research in the Natural Language Processing community but resources and automatic systems developed so far have mainly addressed contemporary texts. However, the recognition and elaboration of events is a crucial step when dealing with historical texts particularly in the current era of massive digitization of historical sources: research in this domain can lead to the development of methodologies and tools that can assist historians in enhancing their work, while having an impact also on the field of Natural Language Processing. Our work aims at shedding light on the complex concept of events when dealing with historical texts. More specifically, we introduce new annotation guidelines for event mentions and types, categorised into 22 classes. Then, we annotate a historical corpus accordingly, and compare two approaches for automatic event detection and classification following this novel scheme. We believe that this work can foster research in a field of inquiry so far underestimated in the area of Temporal Information Processing. To this end, we release new annotation guidelines, a corpus and new models for automatic annotation

    Amélioration de la précision de systèmes d'extraction de relations en utilisant un filtre générique basé sur l'apprentissage statistique

    Get PDF
    RÉSUMÉ L’extraction de relations contribue à l’amélioration de la recherche sémantique, recherche basée sur la compréhension du sens des termes de recherche. Puisque la recherche d’information est principalement axée sur des mots-clés, l’extraction de relations offre un éventail de possibilités en identifiant les liens entre les entités. L’extraction de relations permet entre autres de transformer de l’information non structurée en information structurée. Les bases de connaissances,telles que Google Knowledge Graph et DBpedia, permettent un accès plus précis et plus direct à l’information. Le slot filling, qui consiste à peupler une base de connaissances à partir de textes, a été une tâche très active depuis quelques années faisant l’objet de plusieurs campagnes évaluant la capacité d’extraire automatiquement des relations prédéfinies d’un corpus de documents. Malgré quelques progrès, les résultats de ces compétitions demeurent modestes. Nous nous concentrons sur la tâche de slot filling dans le cadre de la campagne d’évaluation TAC KBP 2013. Cette tâche vise l’extraction de 41 relations prédéfinies basées sur les infobox de Wikipédia (par exemple: title, date of birth, countries of residence, etc.)liées à des entités nommées spécifiques (personnes et organisations). Une entité nommée (l’entité requête) et une relation sont soumises à un système (extracteur de relations) qui doit automatiquement trouver, parmi un corpus de plus de deux millions de documents, toute entité liée à l’entité requête par la relation donnée. Le système doit également retourner un segment textuel justifiant cette relation. Ce mémoire présente un filtre basé sur l’apprentissage statistique dont l’objectif principal est d’améliorer la précision d’extracteurs de relations tout en minimisant l’impact sur le rappel. Notre approche consiste à filtrer la sortie des extracteurs de relations en utilisant un classifieur. Notre filtre est annexé à la sortie de l’extracteur de relations, pouvant ainsi être facilement testé sur n’importe quel système. Notre classifieur est basé sur un large éventail de caractéristiques (features), incluant des caractéristiques statistiques, lexicales, morphosyntaxiques, syntaxiques et sémantiques extraites en majorité des phrases justificatives soumises par les systèmes. Nous proposons également une méthode efficace permettant d’extraire les patrons les plus fréquents (ex.: catégories orphosyntaxiques, dépendances syntaxiques) afin d’en dériver des caractéristiques booléennes utiles pour notre tâche de filtrage. Les caractéristiques utilisées pour l’entraînement des classifieurs sont soit génériques. Ainsi, notre méthode peut être utilisée pour la classification de toute relation prédéfinie. Nous avons testé le filtre sur 14 systèmes ayant participé à la tâche de slot filling. Le filtre permet d’améliorer la précision pour chacun de ces systèmes. Nos résultats démontrent également que le filtre permet d’améliorer la précision du meilleur système de plus de 20% (points de pourcentage) et d’améliorer le F-score pour 20 relations.----------ABSTRACT Relation extraction is becoming a very important challenge for enhanced semantic search. In fact, while traditional information retrieval is mainly focused on keywords, relation extraction opens a whole range of possibilities by identifying the links between concepts and entities. Unstructured data can be transformed into structured data by using effective relation extraction to populate a knowledge base (ex: Google Knowlegde Graph and DBpedia). Slot filling, which mainly consists in the population of a knowledge base, has been a very active task in recent years and has been subject to several evaluation campaigns that assess the ability of automatically extracting previously known relations from corpora. Despite some progress, the results of these competitions remain limited. In this thesis, we focus on the English slot filling track within TAC KBP 2013 evaluation campaign. This track targets the extraction of 41 pre-identified Wikipedia infobox relations (e.g. title, date of birth, countries of residence, etc.) related to specific named entities (persons and organizations). A named entity and a relation are submitted to a system (relation extractor), which must automatically find, within a corpus containing over 2 million documents, every other entity that is linked to the query entity with this particular relation, and must return a textual segment that justifies this result. This thesis presents a machine learning filter whose main objective is to enhance the precision of relation extractors while minimizing the impact on recall. Our approach consists in the filtering of relation extractors’ output using a binary classifier. Our filter is appended to the end of the relation extractor’s pipeline, thus allowing the filter to be tested and operated on any system. Another objective of this research is the identification of the most important features for the filtering step. Our classifier is based on a wide array of features including statistical, lexical, morphosyntactic, syntactic and semantic features. We also present a method for extracting the most frequent patterns (ex: part-of-speech, syntactic dependencies) between the query and the answer within the justification sentence from which we create boolean features indicating the presence of such patterns. The features used for training our classifiers are mostly generic and could be utilized to classify any pre-defined relation. We experimented the classifier on 14 systems participating in the English slot filling track of TAC KBP 2013 campaign. The filter allowed an increase in precision for every tested system. Our results also show that the classifier is able to improve the precision of the best system by more than 20% (in percentage points) and improve the F1-score for 20 relations

    Phylogenetic, Genomic and Morphological Investigations of Three Lance Nematode Species (\u3ci\u3eHoplolaimus\u3c/i\u3e spp.)

    Get PDF
    Lance nematodes (Hoplolaimus spp.) are migratory ecto-endo plant-parasitic. They have been found from a wide range of the world that feed on the roots of a diversity of monocotyledonous and dicotyledonous plants, and have caused a great agricultural damage. Since more taxonomic knowledge and molecular references are demanded for the lance nematode phylogeny and population study, four chapters of lance nematode researches on three species were presented here: (1) A new species, Hoplolaimus smokyensis n. sp., was discovered from a mixed forest sample of maple (Acer sp.), hemlock (Tsuga sp.) and silverbell (Halesia carolina) from the Great Smoky Mountains National Park. It is characterized by possession of a lateral field with four incisures, an excretory pore posterior to the hemizonid, esophageal glands with three nuclei, phasmids anterior and posterior to the vulva, and the epiptygma absent. Phylogenetic analyses based on ribosomal and mitochondrial gene sequences also suggest H. smokyensis n. sp. to be an independent lineage distinct from all other reported Hoplolaimus species. (2) Additional morphological characteristics of Hoplolaimus columbus were described. Photos of its esophageal gland cell nuclei, a H. columbus male and abnormal female tails were presented. (3) The first complete de novo assembly of mitochondrial genome of Hoplolaimus columbus using Whole Genome Amplification and Illumina MiSeq technique was reported as a circularized DNA of 25228bp. The annotation results using two genetic codes were diagnosed and compared. Including H. columbus, phylogenetic relationships, gene content and gene order arrangement of 92 taxa nematodes were analyzed. (4) The phylogenetic informativeness of mitochondrial genes in Nematoda phylum is analyzed with two quantitative methods using mitochondrial genomes of 93 nematode species, including H. columbus and H. galeatus. Results from both methods agree with each other, indicate that the nad5 and nad4 contain higher informativeness than other candidates. Traditional markers like the cox1 and cytb genes contain medium informativeness. The nad4l and nad3 contain the lowest informativeness comparing with other protein-coding genes. Results also indicate that the phylogenetic informativeness is independent of the molecular sequence length of a phylogenetic marker. Concatenated-genes marker could present better phylogenetic informativeness if selected genes are higher informative

    Entity Linking in Low-Annotation Data Settings

    Get PDF
    Recent advances in natural language processing have focused on applying and adapting large pretrained language models to specific tasks. These models, such as BERT (Devlin et al., 2019) and BART (Lewis et al., 2020a), are pretrained on massive amounts of unlabeled text across a variety of domains. The impact of these pretrained models is visible in the task of entity linking, where a mention of an entity in unstructured text is matched to the relevant entry in a knowledge base. State-of-the-art linkers, such as Wu et al. (2020) and De Cao et al. (2021), leverage pretrained models as a foundation for their systems. However, these models are also trained on large amounts of annotated data, which is crucial to their performance. Often these large datasets consist of domains that are easily annotated, such as Wikipedia or newswire text. However, tailoring NLP tools to a narrow variety of textual domains severely restricts their use in the real world. Many other domains, such as medicine or law, do not have large amounts of entity linking annotations available. Entity linking, which serves to bridge the gap between massive unstructured amounts of text and structured repositories of knowledge, is equally crucial in these domains. Yet tools trained on newswire or Wikipedia annotations are unlikely to be well-suited for identifying medical conditions mentioned in clinical notes. As most annotation efforts focus on English, similar challenges can be noted in building systems for non-English text. There is often a relatively small amount of annotated data in these domains. With this being the case, looking to other types of domain-specific data, such as unannotated text or highly-curated structured knowledge bases, is often required. In these settings, it is crucial to translate lessons taken from tools tailored for high-annotation domains into algorithms that are suited for low-annotation domains. This requires both leveraging broader types of data and understanding the unique challenges present in each domain
    • …
    corecore