20 research outputs found
Ontology-based Assisted Curation of Biomedical Data
Manual curation of biomedical data is highly accurate but time consuming, and does not scale with the ever increasing growth of biomedical literature. Text mining as a high-throughput computational technique scales well but requires human expertise to produce highly accurate results. Ontologies can help organizing large quantities of unstructured information. Here we present three systems, namely GoGene, GoPubMed and GoWeb, employing biomedical ontologies and show how they can assist manual curation of biomedical data.

GoGene associates all genes from different model organisms to concepts of the Gene Ontology (GO) and the Medical Subject Headings (MeSH). The hierarchical structures of both terminologies support clustering and summarizing long lists of genes. Through the integration of known gene annotations from UniProt and EntrezGene with text-mined annotations from all abstracts in PubMed, GoGene currently contains up to 4,000,000 associations between genes and concepts from GO and MeSH for ten model organisms. The quality of all associations can be verified by following the links to their origin, that is, literature or database entries.

GoPubMed aims at reducing the limitations of classical keyword search. It handles inconsistent vocabulary such as synonyms and specialized terminology. It shows the most relevant concepts in GO and MeSH for a search and thus reveals information which otherwise remains buried in the masses of text. This feature as well as the entire bibliography of all authors in PubMed facilitate comprehensive literature search. GoWeb translates these ideas to the World Wide Web and is thus not only limited to PubMed abstracts. GoWeb uses a standard web-search service and organizes search results based on GO, MeSH, and other concepts such as companies and institutions
Improved mutation tagging with gene identifiers applied to membrane protein stability prediction
Background
The automated retrieval and integration of information about protein point mutations in combination with structure, domain and interaction data from literature and databases promises to be a valuable approach to study structure-function relationships in biomedical data sets.
Results
We developed a rule- and regular expression-based protein point mutation retrieval pipeline for PubMed abstracts, which shows an F-measure of 87% for the mutation retrieval task on a benchmark dataset. In order to link mutations to their proteins, we utilize a named entity recognition algorithm for the identification of gene names co-occurring in the abstract, and establish links based on sequence checks. Vice versa, we could show that gene recognition improved from 77% to 91% F-measure when considering mutation information given in the text. To demonstrate practical relevance, we utilize mutation information from text to evaluate a novel solvation energy based model for the prediction of stabilizing regions in membrane proteins. For five G protein-coupled receptors we identified 35 relevant single mutations and associated phenotypes, of which none had been annotated in the UniProt or PDB database. In 71% reported phenotypes were in compliance with the model predictions, supporting a relation between mutations and stability issues in membrane proteins.
Conclusion
We present a reliable approach for the retrieval of protein mutations from PubMed abstracts for any set of genes or proteins of interest. We further demonstrate how amino acid substitution information from text can be utilized for protein structure stability studies on the basis of a novel energy model
Systematic feature evaluation for gene name recognition
In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features
GoPubMed: Exploring Pubmed with Ontological Background Knowledge
With the ever increasing size of scientific literature, finding relevant documents and answering questions has become even more of a challenge. Recently, ontologies - hierarchical, controlled vocabularies - have been introduced to annotate genomic data. They can also improve the question answering and the selection of relevant documents in the literature search. Search engines such as GoPubMed.org use ontological background knowledge to give an overview over large query results and to help answering questions. We review the problems and solutions underlying these next generation intelligent search engines and give examples of the power of this new search paradigm
Improved mutation tagging with gene identifiers applied to membrane protein stability prediction
Background
The automated retrieval and integration of information about protein point mutations in combination with structure, domain and interaction data from literature and databases promises to be a valuable approach to study structure-function relationships in biomedical data sets.
Results
We developed a rule- and regular expression-based protein point mutation retrieval pipeline for PubMed abstracts, which shows an F-measure of 87% for the mutation retrieval task on a benchmark dataset. In order to link mutations to their proteins, we utilize a named entity recognition algorithm for the identification of gene names co-occurring in the abstract, and establish links based on sequence checks. Vice versa, we could show that gene recognition improved from 77% to 91% F-measure when considering mutation information given in the text. To demonstrate practical relevance, we utilize mutation information from text to evaluate a novel solvation energy based model for the prediction of stabilizing regions in membrane proteins. For five G protein-coupled receptors we identified 35 relevant single mutations and associated phenotypes, of which none had been annotated in the UniProt or PDB database. In 71% reported phenotypes were in compliance with the model predictions, supporting a relation between mutations and stability issues in membrane proteins.
Conclusion
We present a reliable approach for the retrieval of protein mutations from PubMed abstracts for any set of genes or proteins of interest. We further demonstrate how amino acid substitution information from text can be utilized for protein structure stability studies on the basis of a novel energy model
Learning Patterns for Information Extraction from Free Text
We describe a general approach to the task of information extraction from free text and propose methods for learning syntax patterns automatically from annotated corpora. We study the application of our approach to the extraction of protein-protein interactions from scientific texts. Based on this evaluation, we find that learning patterns outperforms techniques based on handcrafted patterns.