1,391 research outputs found

    MIsA : multilingual 'IsA' extraction from Corpora

    Get PDF

    Mistakes in medical ontologies: Where do they come from and how can they be detected?

    Get PDF
    We present the details of a methodology for quality assurance in large medical terminologies and describe three algorithms that can help terminology developers and users to identify potential mistakes. The methodology is based in part on linguistic criteria and in part on logical and ontological principles governing sound classifications. We conclude by outlining the results of applying the methodology in the form of a taxonomy different types of errors and potential errors detected in SNOMED-CT

    Detecting Dissimilar Classes of Source Code Defects

    Get PDF
    Software maintenance accounts for the most part of the software development cost and efforts, with its major activities focused on the detection, location, analysis and removal of defects present in the software. Although software defects can be originated, and be present, at any phase of the software development life-cycle, implementation (i.e., source code) contains more than three-fourths of the total defects. Due to the diverse nature of the defects, their detection and analysis activities have to be carried out by equally diverse tools, often necessitating the application of multiple tools for reasonable defect coverage that directly increases maintenance overhead. Unified detection tools are known to combine different specialized techniques into a single and massive core, resulting in operational difficulty and maintenance cost increment. The objective of this research was to search for a technique that can detect dissimilar defects using a simplified model and a single methodology, both of which should contribute in creating an easy-to-acquire solution. Following this goal, a ‘Supervised Automation Framework’ named FlexTax was developed for semi-automatic defect mapping and taxonomy generation, which was then applied on a large-scale real-world defect dataset to generate a comprehensive Defect Taxonomy that was verified using machine learning classifiers and manual verification. This Taxonomy, along with an extensive literature survey, was used for comprehension of the properties of different classes of defects, and for developing Defect Similarity Metrics. The Taxonomy, and the Similarity Metrics were then used to develop a defect detection model and associated techniques, collectively named Symbolic Range Tuple Analysis, or SRTA. SRTA relies on Symbolic Analysis, Path Summarization and Range Propagation to detect dissimilar classes of defects using a simplified set of operations. To verify the effectiveness of the technique, SRTA was evaluated by processing multiple real-world open-source systems, by direct comparison with three state-of-the-art tools, by a controlled experiment, by using an established Benchmark, by comparison with other tools through secondary data, and by a large-scale fault-injection experiment conducted using a Mutation-Injection Framework, which relied on the taxonomy developed earlier for the definition of mutation rules. Experimental results confirmed SRTA’s practicality, generality, scalability and accuracy, and proved SRTA’s applicability as a new Defect Detection Technique

    Knowledge extraction from fictional texts

    Get PDF
    Knowledge extraction from text is a key task in natural language processing, which involves many sub-tasks, such as taxonomy induction, named entity recognition and typing, relation extraction, knowledge canonicalization and so on. By constructing structured knowledge from natural language text, knowledge extraction becomes a key asset for search engines, question answering and other downstream applications. However, current knowledge extraction methods mostly focus on prominent real-world entities with Wikipedia and mainstream news articles as sources. The constructed knowledge bases, therefore, lack information about long-tail domains, with fiction and fantasy as archetypes. Fiction and fantasy are core parts of our human culture, spanning from literature to movies, TV series, comics and video games. With thousands of fictional universes which have been created, knowledge from fictional domains are subject of search-engine queries - by fans as well as cultural analysts. Unlike the real-world domain, knowledge extraction on such specific domains like fiction and fantasy has to tackle several key challenges: - Training data: Sources for fictional domains mostly come from books and fan-built content, which is sparse and noisy, and contains difficult structures of texts, such as dialogues and quotes. Training data for key tasks such as taxonomy induction, named entity typing or relation extraction are also not available. - Domain characteristics and diversity: Fictional universes can be highly sophisticated, containing entities, social structures and sometimes languages that are completely different from the real world. State-of-the-art methods for knowledge extraction make assumptions on entity-class, subclass and entity-entity relations that are often invalid for fictional domains. With different genres of fictional domains, another requirement is to transfer models across domains. - Long fictional texts: While state-of-the-art models have limitations on the input sequence length, it is essential to develop methods that are able to deal with very long texts (e.g. entire books), to capture multiple contexts and leverage widely spread cues. This dissertation addresses the above challenges, by developing new methodologies that advance the state of the art on knowledge extraction in fictional domains. - The first contribution is a method, called TiFi, for constructing type systems (taxonomy induction) for fictional domains. By tapping noisy fan-built content from online communities such as Wikia, TiFi induces taxonomies through three main steps: category cleaning, edge cleaning and top-level construction. Exploiting a variety of features from the original input, TiFi is able to construct taxonomies for a diverse range of fictional domains with high precision. - The second contribution is a comprehensive approach, called ENTYFI, for named entity recognition and typing in long fictional texts. Built on 205 automatically induced high-quality type systems for popular fictional domains, ENTYFI exploits the overlap and reuse of these fictional domains on unseen texts. By combining different typing modules with a consolidation stage, ENTYFI is able to do fine-grained entity typing in long fictional texts with high precision and recall. - The third contribution is an end-to-end system, called KnowFi, for extracting relations between entities in very long texts such as entire books. KnowFi leverages background knowledge from 142 popular fictional domains to identify interesting relations and to collect distant training samples. KnowFi devises a similarity-based ranking technique to reduce false positives in training samples and to select potential text passages that contain seed pairs of entities. By training a hierarchical neural network for all relations, KnowFi is able to infer relations between entity pairs across long fictional texts, and achieves gains over the best prior methods for relation extraction.Wissensextraktion ist ein SchlĂŒsselaufgabe bei der Verarbeitung natĂŒrlicher Sprache, und umfasst viele Unteraufgaben, wie Taxonomiekonstruktion, EntitĂ€tserkennung und Typisierung, Relationsextraktion, Wissenskanonikalisierung, etc. Durch den Aufbau von strukturiertem Wissen (z.B. Wissensdatenbanken) aus Texten wird die Wissensextraktion zu einem SchlĂŒsselfaktor fĂŒr Suchmaschinen, Question Answering und andere Anwendungen. Aktuelle Methoden zur Wissensextraktion konzentrieren sich jedoch hauptsĂ€chlich auf den Bereich der realen Welt, wobei Wikipedia und Mainstream- Nachrichtenartikel die Hauptquellen sind. Fiktion und Fantasy sind Kernbestandteile unserer menschlichen Kultur, die sich von Literatur bis zu Filmen, Fernsehserien, Comics und Videospielen erstreckt. FĂŒr Tausende von fiktiven Universen wird Wissen aus Suchmaschinen abgefragt – von Fans ebenso wie von Kulturwissenschaftler. Im Gegensatz zur realen Welt muss die Wissensextraktion in solchen spezifischen DomĂ€nen wie Belletristik und Fantasy mehrere zentrale Herausforderungen bewĂ€ltigen: ‱ Trainingsdaten. Quellen fĂŒr fiktive DomĂ€nen stammen hauptsĂ€chlich aus BĂŒchern und von Fans erstellten Inhalten, die spĂ€rlich und fehlerbehaftet sind und schwierige Textstrukturen wie Dialoge und Zitate enthalten. Trainingsdaten fĂŒr SchlĂŒsselaufgaben wie Taxonomie-Induktion, Named Entity Typing oder Relation Extraction sind ebenfalls nicht verfĂŒgbar. ‱ Domain-Eigenschaften und DiversitĂ€t. Fiktive Universen können sehr anspruchsvoll sein und EntitĂ€ten, soziale Strukturen und manchmal auch Sprachen enthalten, die sich von der realen Welt völlig unterscheiden. Moderne Methoden zur Wissensextraktion machen Annahmen ĂŒber Entity-Class-, Entity-Subclass- und Entity- Entity-Relationen, die fĂŒr fiktive DomĂ€nen oft ungĂŒltig sind. Bei verschiedenen Genres fiktiver DomĂ€nen mĂŒssen Modelle auch ĂŒber fiktive DomĂ€nen hinweg transferierbar sein. ‱ Lange fiktive Texte. WĂ€hrend moderne Modelle EinschrĂ€nkungen hinsichtlich der LĂ€nge der Eingabesequenz haben, ist es wichtig, Methoden zu entwickeln, die in der Lage sind, mit sehr langen Texten (z.B. ganzen BĂŒchern) umzugehen, und mehrere Kontexte und verteilte Hinweise zu erfassen. Diese Dissertation befasst sich mit den oben genannten Herausforderungen, und entwickelt Methoden, die den Stand der Kunst zur Wissensextraktion in fiktionalen DomĂ€nen voranbringen. ‱ Der erste Beitrag ist eine Methode, genannt TiFi, zur Konstruktion von Typsystemen (Taxonomie induktion) fĂŒr fiktive DomĂ€nen. Aus von Fans erstellten Inhalten in Online-Communities wie Wikia induziert TiFi Taxonomien in drei wesentlichen Schritten: Kategoriereinigung, Kantenreinigung und Top-Level- Konstruktion. TiFi nutzt eine Vielzahl von Informationen aus den ursprĂŒnglichen Quellen und ist in der Lage, Taxonomien fĂŒr eine Vielzahl von fiktiven DomĂ€nen mit hoher PrĂ€zision zu erstellen. ‱ Der zweite Beitrag ist ein umfassender Ansatz, genannt ENTYFI, zur Erkennung von EntitĂ€ten, und deren Typen, in langen fiktiven Texten. Aufbauend auf 205 automatisch induzierten hochwertigen Typsystemen fĂŒr populĂ€re fiktive DomĂ€nen nutzt ENTYFI die Überlappung und Wiederverwendung dieser fiktiven DomĂ€nen zur Bearbeitung neuer Texte. Durch die Zusammenstellung verschiedener Typisierungsmodule mit einer Konsolidierungsphase ist ENTYFI in der Lage, in langen fiktionalen Texten eine feinkörnige EntitĂ€tstypisierung mit hoher PrĂ€zision und Abdeckung durchzufĂŒhren. ‱ Der dritte Beitrag ist ein End-to-End-System, genannt KnowFi, um Relationen zwischen EntitĂ€ten aus sehr langen Texten wie ganzen BĂŒchern zu extrahieren. KnowFi nutzt Hintergrundwissen aus 142 beliebten fiktiven DomĂ€nen, um interessante Beziehungen zu identifizieren und Trainingsdaten zu sammeln. KnowFi umfasst eine Ă€hnlichkeitsbasierte Ranking-Technik, um falsch positive EintrĂ€ge in Trainingsdaten zu reduzieren und potenzielle Textpassagen auszuwĂ€hlen, die Paare von Kandidats-EntitĂ€ten enthalten. Durch das Trainieren eines hierarchischen neuronalen Netzwerkes fĂŒr alle Relationen ist KnowFi in der Lage, Relationen zwischen EntitĂ€tspaaren aus langen fiktiven Texten abzuleiten, und ĂŒbertrifft die besten frĂŒheren Methoden zur Relationsextraktion

    Automatic taxonomy evaluation

    Full text link
    This thesis would not be made possible without the generous support of IATA.Les taxonomies sont une représentation essentielle des connaissances, jouant un rÎle central dans de nombreuses applications riches en connaissances. Malgré cela, leur construction est laborieuse que ce soit manuellement ou automatiquement, et l'évaluation quantitative de taxonomies est un sujet négligé. Lorsque les chercheurs se concentrent sur la construction d'une taxonomie à partir de grands corpus non structurés, l'évaluation est faite souvent manuellement, ce qui implique des biais et se traduit souvent par une reproductibilité limitée. Les entreprises qui souhaitent améliorer leur taxonomie manquent souvent d'étalon ou de référence, une sorte de taxonomie bien optimisée pouvant service de référence. Par conséquent, des connaissances et des efforts spécialisés sont nécessaires pour évaluer une taxonomie. Dans ce travail, nous soutenons que l'évaluation d'une taxonomie effectuée automatiquement et de maniÚre reproductible est aussi importante que la génération automatique de telles taxonomies. Nous proposons deux nouvelles méthodes d'évaluation qui produisent des scores moins biaisés: un modÚle de classification de la taxonomie extraite d'un corpus étiqueté, et un modÚle de langue non supervisé qui sert de source de connaissances pour évaluer les relations hyperonymiques. Nous constatons que nos substituts d'évaluation corrÚlent avec les jugements humains et que les modÚles de langue pourraient imiter les experts humains dans les tùches riches en connaissances.Taxonomies are an essential knowledge representation and play an important role in classification and numerous knowledge-rich applications, yet quantitative taxonomy evaluation remains to be overlooked and left much to be desired. While studies focus on automatic taxonomy construction (ATC) for extracting meaningful structures and semantics from large corpora, their evaluation is usually manual and subject to bias and low reproducibility. Companies wishing to improve their domain-focused taxonomies also suffer from lacking ground-truths. In fact, manual taxonomy evaluation requires substantial labour and expert knowledge. As a result, we argue in this thesis that automatic taxonomy evaluation (ATE) is just as important as taxonomy construction. We propose two novel taxonomy evaluation methods for automatic taxonomy scoring, leveraging supervised classification for labelled corpora and unsupervised language modelling as a knowledge source for unlabelled data. We show that our evaluation proxies can exert similar effects and correlate well with human judgments and that language models can imitate human experts on knowledge-rich tasks
    • 

    corecore