10 research outputs found

    Semi-automated ontology generation within OBO-Edit

    Get PDF
    Motivation: Ontologies and taxonomies have proven highly beneficial for biocuration. The Open Biomedical Ontology (OBO) Foundry alone lists over 90 ontologies mainly built with OBO-Edit. Creating and maintaining such ontologies is a labour-intensive, difficult, manual process. Automating parts of it is of great importance for the further development of ontologies and for biocuration

    An Automatic Approach for Bilingual Tuberculosis Ontology Based on Ontology Design Patterns (ODPs)

    Get PDF
    Ontology is a representation term used to describe and represent a domain of knowledge. Manually ontology development is currently considered complex, requiring a lot of time and effort. This research was proposed to develop methods to build automatic domain ontology bilingual in Indonesian and English by using corpus and ontology design patterns (ODPs) in tuberculosis disease. In this study, the methods used were to combine ontology learning from text and ontology design patterns to decrease the role of expert knowledge. The methods in this research consist of six stages are term and relation extraction, matching with Tuberculosis glossary, matching with ODPs, score computation similarity term and relations with ODPs, ontology building and ontology evaluation. The results of ontology construction were 362 terms and 44 relations with 260 terms were added. The calculation accuracy of ontology construction was 71%. Ontology construction had higher complexity and shorter time as well as decreases the role of the expert knowledge which proof that the automatic ontology evaluation is better than manual ontology construction

    Automatic Ontology Construction Using Text Corpora and Ontology Design Patterns (ODPs) in Alzheimer\u27s Disease

    Get PDF
    An ontology is defined as an explicit specification of a conceptualization, which is an important tool for modeling, sharing and reuse of domain knowledge. However, ontology construction by hand is a complex and a time consuming task. This research presents a fully automatic method to build bilingual domain ontology from text corpora and ontology design patterns (ODPs) in Alzheimer\u27s disease. This method combines two approaches: ontology learning from texts and matching with ODPs. It consists of six steps: (i) Term & relation extraction (ii) Matching with Alzheimer glossary (iii) Matching with ontology design patterns (iv) Score computation similarity term & relation with ODPs (v) Ontology building (vi) Ontology evaluation. The result of ontology composed of 381 terms and 184 relations with 200 new terms and 42 new relations were added. Fully automatic ontology construction has higher complexity, shorter time and reduces role of the expert knowledge to evaluate ontology than manual ontology construction. This proposed method is sufficiently flexible to be applied to other domains

    Using Noun Phrases for Navigating Biomedical Literature on Pubmed: How Many Updates Are We Losing Track of?

    Get PDF
    Author-supplied citations are a fraction of the related literature for a paper. The “related citations” on PubMed is typically dozens or hundreds of results long, and does not offer hints why these results are related. Using noun phrases derived from the sentences of the paper, we show it is possible to more transparently navigate to PubMed updates through search terms that can associate a paper with its citations. The algorithm to generate these search terms involved automatically extracting noun phrases from the paper using natural language processing tools, and ranking them by the number of occurrences in the paper compared to the number of occurrences on the web. We define search queries having at least one instance of overlap between the author-supplied citations of the paper and the top 20 search results as citation validated (CV). When the overlapping citations were written by same authors as the paper itself, we define it as CV-S and different authors is defined as CV-D. For a systematic sample of 883 papers on PubMed Central, at least one of the search terms for 86% of the papers is CV-D versus 65% for the top 20 PubMed “related citations.” We hypothesize these quantities computed for the 20 million papers on PubMed to differ within 5% of these percentages. Averaged across all 883 papers, 5 search terms are CV-D, and 10 search terms are CV-S, and 6 unique citations validate these searches. Potentially related literature uncovered by citation-validated searches (either CV-S or CV-D) are on the order of ten per paper – many more if the remaining searches that are not citation-validated are taken into account. The significance and relationship of each search result to the paper can only be vetted and explained by a researcher with knowledge of or interest in that paper

    Semi-automated Ontology Generation for Biocuration and Semantic Search

    Get PDF
    Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and ProtĂ©gĂ©, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org

    Semi-automated Ontology Generation for Biocuration and Semantic Search

    Get PDF
    Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and ProtĂ©gĂ©, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org

    Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed

    Get PDF
    The patent domain is a very important source of scientific information that is currently not used to its full potential. Searching for relevant patents is a complex task because the number of existing patents is very high and grows quickly, patent text is extremely complicated, and standard vocabulary is not used consistently or doesn’t even exist. As a consequence, pure keyword searches often fail to return satisfying results in the patent domain. Major companies employ patent professionals who are able to search patents effectively, but even they have to invest a lot of time and effort into their search. Academic scientists on the other hand do not have access to such resources and therefore often do not search patents at all, but they risk missing up-to-date information that will not be published in scientific publications until much later, if it is published at all. Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Similarly, professional patent searches expand beyond keywords by including class codes from various patent classification systems. However, classification-based searches can only be performed effectively if the user has very detailed knowledge of the system, which is usually not the case for academic scientists. Consequently, we investigated methods to automatically identify relevant classes that can then be suggested to the user to expand their query. Since every patent is assigned at least one class code, it should be possible for these assignments to be used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. In order to gain such knowledge, we perform an in-depth comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms. Our analysis shows that the hierarchies are structurally similar, but terms and annotations differ significantly. The most important differences concern the considerably higher complexity of the IPC class definitions compared to MeSH terms and the far lower number of class assignments to the average patent compared to the number of MeSH terms assigned to PubMed documents. As a result of these differences, problems are caused both for unexperienced patent searchers and professionals. On the one hand, the complex term system makes it very difficult for members of the former group to find any IPC classes that are relevant for their search task. On the other hand, the low number of IPC classes per patent points to incomplete class assignments by the patent office, therefore limiting the recall of the classification-based searches that are frequently performed by the latter group. We approach these problems from two directions: First, by automatically assigning additional patent classes to make up for the missing assignments, and second, by automatically retrieving relevant keywords and classes that are proposed to the user so they can expand their initial search. For the automated assignment of additional patent classes, we adapt an approach to the patent domain that was successfully used for the assignment of MeSH terms to PubMed abstracts. Each document is assigned a set of IPC classes by a large set of binary Maximum-Entropy classifiers. Our evaluation shows good performance by individual classifiers (precision/recall between 0:84 and 0:90), making the retrieval of additional relevant documents for specific IPC classes feasible. The assignment of additional classes to specific documents is more problematic, since the precision of our classifiers is not high enough to avoid false positives. However, we propose filtering methods that can help solve this problem. For the guided patent search, we demonstrate various methods to expand a user’s initial query. Our methods use both keywords and class codes that the user enters to retrieve additional relevant keywords and classes that are then suggested to the user. These additional query components are extracted from different sources such as patent text, IPC definitions, external vocabularies and co-occurrence data. The suggested expansions can help unexperienced users refine their queries with relevant IPC classes, and professionals can compose their complete query faster and more easily. We also present GoPatents, a patent retrieval prototype that incorporates some of our proposals and makes faceted browsing of a patent corpus possible

    Automated extension of biomedical ontologies

    Get PDF
    Developing and extending a biomedical ontology is a very demanding process, particularly because biomedical knowledge is diverse, complex and continuously changing and growing. Existing automated and semi-automated techniques are not tailored to handling the issues in extending biomedical ontologies. This thesis advances the state of the art in semi-automated ontology extension by presenting a framework as well as methods and methodologies for automating ontology extension specifically designed to address the features of biomedical ontologies.The overall strategy is based on first predicting the areas of the ontology that are in need of extension and then applying ontology learning and ontology matching techniques to extend them. A novel machine learning approach for predicting these areas based on features of past ontology versions was developed and successfully applied to the Gene Ontology. Methods and techniques were also specifically designed for matching biomedical ontologies and retrieving relevant biomedical concepts from text, which were shown to be successful in several applications.O desenvolvimento e extensĂŁo de uma ontologia biomĂ©dica Ă© um processo muito exigente, dada a diversidade, complexidade e crescimento contĂ­nuo do conhecimento biomĂ©dico. As tĂ©cnicas existentes nesta ĂĄrea nĂŁo estĂŁo preparadas para lidar com os desafios da extensĂŁo de uma ontologia biomĂ©dica. Esta tese avança o estado da arte na extensĂŁo semi-automĂĄtica de ontologias, apresentando uma framework assim como mĂ©todos e metodologias para a automação da extensĂŁo de ontologias especificamente desenhados tendo em conta as caracterĂ­sticas das ontologias biomĂ©dicas. A estratĂ©gia global Ă© baseada em primeiro prever quais as ĂĄreas da ontologia que necessitam extensĂŁo, e depois usĂĄ-las como enfoque para tĂ©cnicas de alinhamento e aprendizagem de ontologias, com o objectivo de as estender. Uma nova estratĂ©gia de aprendizagem automĂĄtica para prever estas ĂĄreas baseada em atributos de antigas versĂ”es de ontologias foi desenvolvida e testada com sucesso na Gene Ontology. Foram tambĂ©m especificamente desenvolvidos mĂ©todos e tĂ©cnicas para o alinhamento de ontologias biomĂ©dicas e extracção de conceitos relevantes de texto, cujo sucesso foi demonstrado em vĂĄrias aplicaçÔes.Fundação para a CiĂȘncia e a Tecnologi

    Developing Ontological Background Knowledge for Biomedicine

    Full text link
    Biomedicine is an impressively fast developing, interdisciplinary field of research. To control the growing volumes of biomedical data, ontologies are increasingly used as common organization structures. Biomedical ontologies describe domain knowledge in a formal, computationally accessible way. They serve as controlled vocabularies and background knowledge in applications dealing with the integration, analysis and retrieval of heterogeneous types of data. The development of biomedical ontologies, however, is hampered by specific challenges. They include the lack of quality standards, resulting in very heterogeneous resources, and the decentralized development of biomedical ontologies, causing the increasing fragmentation of domain knowledge across them. In the first part of this thesis, a life cycle model for biomedical ontologies is developed, which is intended to cope with these challenges. It comprises the stages "requirements analysis", "design and implementation", "evaluation", "documentation and release" and "maintenance". For each stage, associated subtasks and activities are specified. To promote quality standards for biomedical ontology development, an emphasis is set on the evaluation stage. As part of it, comprehensive evaluation procedures are specified, which allow to assess the quality of ontologies on various levels. To tackle the issue of knowledge fragmentation, the life cycle model is extended to also cover ontology alignments. Ontology alignments specify mappings between related elements of different ontologies. By making potential overlaps and similarities between ontologies explicit, they support the integration of ontologies and help reduce the fragmentation of knowledge. In the second part of this thesis, the life cycle model for biomedical ontologies and alignments is validated by means of five case studies. As a result, they confirm that the model is effective. Four of the case studies demonstrate that it is able to support the development of useful new ontologies and alignments. The latter facilitate novel natural language processing and bioinformatics applications, and in one case constitute the basis of a task of the "BioNLP shared task 2013", an international challenge on biomedical information extraction. The fifth case study shows that the presented evaluation procedures are an effective means to check and improve the quality of ontology alignments. Hence, they support the crucial task of quality assurance of alignments, which are themselves increasingly used as reference standards in evaluations of automatic ontology alignment systems. Both, the presented life cycle model and the ontologies and alignments that have resulted from its validation improve information and knowledge management in biomedicine and thus promote biomedical research
    corecore