10 research outputs found
Semi-automated ontology generation within OBO-Edit
Motivation: Ontologies and taxonomies have proven highly beneficial for biocuration. The Open Biomedical Ontology (OBO) Foundry alone lists over 90 ontologies mainly built with OBO-Edit. Creating and maintaining such ontologies is a labour-intensive, difficult, manual process. Automating parts of it is of great importance for the further development of ontologies and for biocuration
An Automatic Approach for Bilingual Tuberculosis Ontology Based on Ontology Design Patterns (ODPs)
Ontology is a representation term used to describe and represent a domain of knowledge. Manually ontology development is currently considered complex, requiring a lot of time and effort. This research was proposed to develop methods to build automatic domain ontology bilingual in Indonesian and English by using corpus and ontology design patterns (ODPs) in tuberculosis disease. In this study, the methods used were to combine ontology learning from text and ontology design patterns to decrease the role of expert knowledge. The methods in this research consist of six stages are term and relation extraction, matching with Tuberculosis glossary, matching with ODPs, score computation similarity term and relations with ODPs, ontology building and ontology evaluation. The results of ontology construction were 362 terms and 44 relations with 260 terms were added. The calculation accuracy of ontology construction was 71%. Ontology construction had higher complexity and shorter time as well as decreases the role of the expert knowledge which proof that the automatic ontology evaluation is better than manual ontology construction
Automatic Ontology Construction Using Text Corpora and Ontology Design Patterns (ODPs) in Alzheimer\u27s Disease
An ontology is defined as an explicit specification of a conceptualization, which is an important tool for modeling, sharing and reuse of domain knowledge. However, ontology construction by hand is a complex and a time consuming task. This research presents a fully automatic method to build bilingual domain ontology from text corpora and ontology design patterns (ODPs) in Alzheimer\u27s disease. This method combines two approaches: ontology learning from texts and matching with ODPs. It consists of six steps: (i) Term & relation extraction (ii) Matching with Alzheimer glossary (iii) Matching with ontology design patterns (iv) Score computation similarity term & relation with ODPs (v) Ontology building (vi) Ontology evaluation. The result of ontology composed of 381 terms and 184 relations with 200 new terms and 42 new relations were added. Fully automatic ontology construction has higher complexity, shorter time and reduces role of the expert knowledge to evaluate ontology than manual ontology construction. This proposed method is sufficiently flexible to be applied to other domains
Using Noun Phrases for Navigating Biomedical Literature on Pubmed: How Many Updates Are We Losing Track of?
Author-supplied citations are a fraction of the related literature for a paper. The ârelated citationsâ on PubMed is typically dozens or hundreds of results long, and does not offer hints why these results are related. Using noun phrases derived from the sentences of the paper, we show it is possible to more transparently navigate to PubMed updates through search terms that can associate a paper with its citations. The algorithm to generate these search terms involved automatically extracting noun phrases from the paper using natural language processing tools, and ranking them by the number of occurrences in the paper compared to the number of occurrences on the web. We define search queries having at least one instance of overlap between the author-supplied citations of the paper and the top 20 search results as citation validated (CV). When the overlapping citations were written by same authors as the paper itself, we define it as CV-S and different authors is defined as CV-D. For a systematic sample of 883 papers on PubMed Central, at least one of the search terms for 86% of the papers is CV-D versus 65% for the top 20 PubMed ârelated citations.â We hypothesize these quantities computed for the 20 million papers on PubMed to differ within 5% of these percentages. Averaged across all 883 papers, 5 search terms are CV-D, and 10 search terms are CV-S, and 6 unique citations validate these searches. Potentially related literature uncovered by citation-validated searches (either CV-S or CV-D) are on the order of ten per paper â many more if the remaining searches that are not citation-validated are taken into account. The significance and relationship of each search result to the paper can only be vetted and explained by a researcher with knowledge of or interest in that paper
Semi-automated Ontology Generation for Biocuration and Semantic Search
Background:
In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies â controlled, hierarchical vocabularies â are being developed.
Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing.
Motivation:
The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences.
Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods.
Results:
The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results.
To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org
Semi-automated Ontology Generation for Biocuration and Semantic Search
Background:
In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies â controlled, hierarchical vocabularies â are being developed.
Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing.
Motivation:
The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences.
Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods.
Results:
The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results.
To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org
Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed
The patent domain is a very important source of scientific information that is currently not used to its full potential. Searching for relevant patents is a complex task because the number of existing patents is very high and grows quickly, patent text is extremely complicated, and standard vocabulary is not used consistently or doesnât even exist. As a consequence, pure keyword searches often fail to return satisfying results in the patent domain. Major companies employ patent professionals who are able to search patents effectively, but even they have to invest a lot of time and effort into their search. Academic scientists on the other hand do not have access to such resources and therefore often do not search patents at all, but they risk missing up-to-date information that will not be published in scientific publications until much later, if it is published at all.
Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Similarly, professional patent searches expand beyond keywords by including class codes from various patent classification systems. However, classification-based searches can only be performed effectively if the user has very detailed knowledge of the system, which is usually not the case for academic scientists. Consequently, we investigated methods to automatically identify relevant classes that can then be suggested to the user to expand their query. Since every patent is assigned at least one class code, it should be possible for these assignments to be used in a similar way as the MeSH annotations in PubMed.
In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. In order to gain such knowledge, we perform an in-depth comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms. Our analysis shows that the hierarchies are structurally similar, but terms and annotations differ significantly. The most important differences concern the considerably higher complexity of the IPC class definitions compared to MeSH terms and the far lower number of class assignments to the average patent compared to the number of MeSH terms assigned to PubMed documents.
As a result of these differences, problems are caused both for unexperienced patent searchers and professionals. On the one hand, the complex term system makes it very difficult for members of the former group to find any IPC classes that are relevant for their search task. On the other hand, the low number of IPC classes per patent points to incomplete class assignments by the patent office, therefore limiting the recall of the classification-based searches that are frequently performed by the latter group. We approach these problems from two directions: First, by automatically assigning additional patent classes to make up for the missing assignments, and second, by automatically retrieving relevant keywords and classes that are proposed to the user so they can expand their initial search.
For the automated assignment of additional patent classes, we adapt an approach to the patent domain that was successfully used for the assignment of MeSH terms to PubMed abstracts. Each document is assigned a set of IPC classes by a large set of binary Maximum-Entropy classifiers. Our evaluation shows good performance by individual classifiers (precision/recall between 0:84 and 0:90), making the retrieval of additional relevant documents for specific IPC classes feasible. The assignment of additional classes to specific documents is more problematic, since the precision of our classifiers is not high enough to avoid false positives. However, we propose filtering methods that can help solve this problem.
For the guided patent search, we demonstrate various methods to expand a userâs initial query. Our methods use both keywords and class codes that the user enters to retrieve additional relevant keywords and classes that are then suggested to the user. These additional query components are extracted from different sources such as patent text, IPC definitions, external vocabularies and co-occurrence data. The suggested expansions can help unexperienced users refine their queries with relevant IPC classes, and professionals can compose their complete query faster and more easily. We also present GoPatents, a patent retrieval prototype that incorporates some of our proposals and makes faceted browsing of a patent corpus possible
Automated extension of biomedical ontologies
Developing and extending a biomedical ontology is a very demanding
process, particularly because biomedical knowledge is diverse, complex
and continuously changing and growing. Existing automated
and semi-automated techniques are not tailored to handling the issues
in extending biomedical ontologies.
This thesis advances the state of the art in semi-automated ontology
extension by presenting a framework as well as methods and
methodologies for automating ontology extension specifically designed
to address the features of biomedical ontologies.The overall strategy is
based on first predicting the areas of the ontology that are in need of
extension and then applying ontology learning and ontology matching
techniques to extend them. A novel machine learning approach for
predicting these areas based on features of past ontology versions was
developed and successfully applied to the Gene Ontology. Methods
and techniques were also specifically designed for matching biomedical
ontologies and retrieving relevant biomedical concepts from text,
which were shown to be successful in several applications.O desenvolvimento e extensão de uma ontologia biomédica é um processo
muito exigente, dada a diversidade, complexidade e crescimento
contĂnuo do conhecimento biomĂ©dico. As tĂ©cnicas existentes nesta
ĂĄrea nĂŁo estĂŁo preparadas para lidar com os desafios da extensĂŁo de
uma ontologia biomédica.
Esta tese avança o estado da arte na extensão semi-automåtica de ontologias,
apresentando uma framework assim como métodos e metodologias
para a automação da extensão de ontologias especificamente desenhados
tendo em conta as caracterĂsticas das ontologias biomĂ©dicas.
A estratégia global é baseada em primeiro prever quais as åreas da ontologia
que necessitam extensĂŁo, e depois usĂĄ-las como enfoque para
técnicas de alinhamento e aprendizagem de ontologias, com o objectivo
de as estender. Uma nova estratégia de aprendizagem automåtica
para prever estas åreas baseada em atributos de antigas versÔes de
ontologias foi desenvolvida e testada com sucesso na Gene Ontology.
Foram também especificamente desenvolvidos métodos e técnicas para
o alinhamento de ontologias biomédicas e extracção de conceitos relevantes
de texto, cujo sucesso foi demonstrado em vĂĄrias aplicaçÔes.Fundação para a CiĂȘncia e a Tecnologi
Developing Ontological Background Knowledge for Biomedicine
Biomedicine is an impressively fast developing, interdisciplinary field of
research. To control the growing volumes of biomedical data, ontologies are
increasingly used as common organization structures. Biomedical ontologies
describe domain knowledge in a formal, computationally accessible way. They
serve as controlled vocabularies and background knowledge in applications
dealing with the integration, analysis and retrieval of heterogeneous types
of data. The development of biomedical ontologies, however, is hampered by
specific challenges. They include the lack of quality standards, resulting
in very heterogeneous resources, and the decentralized development of
biomedical ontologies, causing the increasing fragmentation of domain
knowledge across them.
In the first part of this thesis, a life cycle model for biomedical
ontologies is developed, which is intended to cope with these challenges.
It comprises the stages "requirements analysis", "design and
implementation", "evaluation", "documentation and release" and
"maintenance". For each stage, associated subtasks and activities are
specified. To promote quality standards for biomedical ontology
development, an emphasis is set on the evaluation stage. As part of it,
comprehensive evaluation procedures are specified, which allow to assess
the quality of ontologies on various levels. To tackle the issue of
knowledge fragmentation, the life cycle model is extended to also cover
ontology alignments. Ontology alignments specify mappings between related
elements of different ontologies. By making potential overlaps and
similarities between ontologies explicit, they support the integration of
ontologies and help reduce the fragmentation of knowledge.
In the second part of this thesis, the life cycle model for biomedical
ontologies and alignments is validated by means of five case studies. As a
result, they confirm that the model is effective. Four of the case studies
demonstrate that it is able to support the development of useful new
ontologies and alignments. The latter facilitate novel natural language
processing and bioinformatics applications, and in one case constitute the
basis of a task of the "BioNLP shared task 2013", an international
challenge on biomedical information extraction. The fifth case study shows
that the presented evaluation procedures are an effective means to check
and improve the quality of ontology alignments. Hence, they support the
crucial task of quality assurance of alignments, which are themselves
increasingly used as reference standards in evaluations of automatic
ontology alignment systems. Both, the presented life cycle model and the
ontologies and alignments that have resulted from its validation improve
information and knowledge management in biomedicine and thus promote
biomedical research