93 research outputs found

    Comparative analysis of five protein-protein interaction corpora

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Growing interest in the application of natural language processing methods to biomedical text has led to an increasing number of corpora and methods targeting protein-protein interaction (PPI) extraction. However, there is no general consensus regarding PPI annotation and consequently resources are largely incompatible and methods are difficult to evaluate.</p> <p>Results</p> <p>We present the first comparative evaluation of the diverse PPI corpora, performing quantitative evaluation using two separate information extraction methods as well as detailed statistical and qualitative analyses of their properties. For the evaluation, we unify the corpus PPI annotations to a shared level of information, consisting of undirected, untyped binary interactions of non-static types with no identification of the words specifying the interaction, no negations, and no interaction certainty.</p> <p>We find that the F-score performance of a state-of-the-art PPI extraction method varies on average 19 percentage units and in some cases over 30 percentage units between the different evaluated corpora. The differences stemming from the choice of corpus can thus be substantially larger than differences between the performance of PPI extraction methods, which suggests definite limits on the ability to compare methods evaluated on different resources. We analyse a number of potential sources for these differences and identify factors explaining approximately half of the variance. We further suggest ways in which the difficulty of the PPI extraction tasks codified by different corpora can be determined to advance comparability. Our analysis also identifies points of agreement and disagreement in PPI corpus annotation that are rarely explicitly stated by the authors of the corpora.</p> <p>Conclusions</p> <p>Our comparative analysis uncovers key similarities and differences between the diverse PPI corpora, thus taking an important step towards standardization. In the course of this study we have created a major practical contribution in converting the corpora into a shared format. The conversion software is freely available at <url>http://mars.cs.utu.fi/PPICorpora</url>.</p

    Mining the Gene Wiki for functional genomic knowledge

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Ontology-based gene annotations are important tools for organizing and analyzing genome-scale biological data. Collecting these annotations is a valuable but costly endeavor. The Gene Wiki makes use of Wikipedia as a low-cost, mass-collaborative platform for assembling text-based gene annotations. The Gene Wiki is comprised of more than 10,000 review articles, each describing one human gene. The goal of this study is to define and assess a computational strategy for translating the text of Gene Wiki articles into ontology-based gene annotations. We specifically explore the generation of structured annotations using the Gene Ontology and the Human Disease Ontology.</p> <p>Results</p> <p>Our system produced 2,983 candidate gene annotations using the Disease Ontology and 11,022 candidate annotations using the Gene Ontology from the text of the Gene Wiki. Based on manual evaluations and comparisons to reference annotation sets, we estimate a precision of 90-93% for the Disease Ontology annotations and 48-64% for the Gene Ontology annotations. We further demonstrate that this data set can systematically improve the results from gene set enrichment analyses.</p> <p>Conclusions</p> <p>The Gene Wiki is a rapidly growing corpus of text focused on human gene function. Here, we demonstrate that the Gene Wiki can be a powerful resource for generating ontology-based gene annotations. These annotations can be used immediately to improve workflows for building curated gene annotation databases and knowledge-based statistical analyses.</p

    An environment for relation mining over richly annotated corpora: the case of GENIA

    Get PDF
    BACKGROUND: The biomedical domain is witnessing a rapid growth of the amount of published scientific results, which makes it increasingly difficult to filter the core information. There is a real need for support tools that 'digest' the published results and extract the most important information. RESULTS: We describe and evaluate an environment supporting the extraction of domain-specific relations, such as protein-protein interactions, from a richly-annotated corpus. We use full, deep-linguistic parsing and manually created, versatile patterns, expressing a large set of syntactic alternations, plus semantic ontology information. CONCLUSION: The experiments show that our approach described is capable of delivering high-precision results, while maintaining sufficient levels of recall. The high level of abstraction of the rules used by the system, which are considerably more powerful and versatile than finite-state approaches, allows speedy interactive development and validation

    GenCLiP: a software program for clustering gene lists by literature profiling and constructing gene co-occurrence networks related to custom keywords

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Biomedical researchers often want to explore pathogenesis and pathways regulated by abnormally expressed genes, such as those identified by microarray analyses. Literature mining is an important way to assist in this task. Many literature mining tools are now available. However, few of them allows the user to make manual adjustments to zero in on what he/she wants to know in particular.</p> <p>Results</p> <p>We present our software program, GenCLiP (Gene Cluster with Literature Profiles), which is based on the methods presented by Chaussabel and Sher (<it>Genome Biol </it>2002, 3(10):RESEARCH0055) that search gene lists to identify functional clusters of genes based on up-to-date literature profiling. Four features were added to this previously described method: the ability to 1) manually curate keywords extracted from the literature, 2) search genes and gene co-occurrence networks related to custom keywords, 3) compare analyzed gene results with negative and positive controls generated by GenCLiP, and 4) calculate probabilities that the resulting genes and gene networks are randomly related. In this paper, we show with a set of differentially expressed genes between keloids and normal control, how implementation of functions in GenCLiP successfully identified keywords related to the pathogenesis of keloids and unknown gene pathways involved in the pathogenesis of keloids.</p> <p>Conclusion</p> <p>With regard to the identification of disease-susceptibility genes, GenCLiP allows one to quickly acquire a primary pathogenesis profile and identify pathways involving abnormally expressed genes not previously associated with the disease.</p

    Extracting causal relations on HIV drug resistance from literature

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In HIV treatment it is critical to have up-to-date resistance data of applicable drugs since HIV has a very high rate of mutation. These data are made available through scientific publications and must be extracted manually by experts in order to be used by virologists and medical doctors. Therefore there is an urgent need for a tool that partially automates this process and is able to retrieve relations between drugs and virus mutations from literature.</p> <p>Results</p> <p>In this work we present a novel method to extract and combine relationships between HIV drugs and mutations in viral genomes. Our extraction method is based on natural language processing (NLP) which produces grammatical relations and applies a set of rules to these relations. We applied our method to a relevant set of PubMed abstracts and obtained 2,434 extracted relations with an estimated performance of 84% for F-score. We then combined the extracted relations using logistic regression to generate resistance values for each <drug, mutation> pair. The results of this relation combination show more than 85% agreement with the Stanford HIVDB for the ten most frequently occurring mutations. The system is used in 5 hospitals from the Virolab project <url>http://www.virolab.org</url> to preselect the most relevant novel resistance data from literature and present those to virologists and medical doctors for further evaluation.</p> <p>Conclusions</p> <p>The proposed relation extraction and combination method has a good performance on extracting HIV drug resistance data. It can be used in large-scale relation extraction experiments. The developed methods can also be applied to extract other type of relations such as gene-protein, gene-disease, and disease-mutation.</p

    Transboundary determinants of avian zoonotic infectious diseases: challenges for strengthening research capacity and connecting surveillance networks

    Get PDF
    As the climate changes, global systems have become increasingly unstable and unpredictable. This is particularly true for many disease systems, including subtypes of highly pathogenic avian influenzas (HPAIs) that are circulating the world. Ecological patterns once thought stable are changing, bringing new populations and organisms into contact with one another. Wild birds continue to be hosts and reservoirs for numerous zoonotic pathogens, and strains of HPAI and other pathogens have been introduced into new regions via migrating birds and transboundary trade of wild birds. With these expanding environmental changes, it is even more crucial that regions or counties that previously did not have surveillance programs develop the appropriate skills to sample wild birds and add to the understanding of pathogens in migratory and breeding birds through research. For example, little is known about wild bird infectious diseases and migration along the Mediterranean and Black Sea Flyway (MBSF), which connects Europe, Asia, and Africa. Focusing on avian influenza and the microbiome in migratory wild birds along the MBSF, this project seeks to understand the determinants of transboundary disease propagation and coinfection in regions that are connected by this flyway. Through the creation of a threat reduction network for avian diseases (Avian Zoonotic Disease Network, AZDN) in three countries along the MBSF (Georgia, Ukraine, and Jordan), this project is strengthening capacities for disease diagnostics; microbiomes; ecoimmunology; field biosafety; proper wildlife capture and handling; experimental design; statistical analysis; and vector sampling and biology. Here, we cover what is required to build a wild bird infectious disease research and surveillance program, which includes learning skills in proper bird capture and handling; biosafety and biosecurity; permits; next generation sequencing; leading-edge bioinformatics and statistical analyses; and vector and environmental sampling. Creating connected networks for avian influenzas and other pathogen surveillance will increase coordination and strengthen biosurveillance globally in wild birds

    Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation

    Get PDF
    BACKGROUND: High-throughput experiments, such as with DNA microarrays, typically result in hundreds of genes potentially relevant to the process under study, rendering the interpretation of these experiments problematic. Here, we propose and evaluate an approach to find functional associations between large numbers of genes and other biomedical concepts from free-text literature. For each gene, a profile of related concepts is constructed that summarizes the context in which the gene is mentioned in literature. We assign a weight to each concept in the profile based on a likelihood ratio measure. Gene concept profiles can then be clustered to find related genes and other concepts. RESULTS: The experimental validation was done in two steps. We first applied our method on a controlled test set. After this proved to be successful the datasets from two DNA microarray experiments were analyzed in the same way and the results were evaluated by domain experts. The first dataset was a gene-expression profile that characterizes the cancer cells of a group of acute myeloid leukemia patients. For this group of patients the biological background of the cancer cells is largely unknown. Using our methodology we found an association of these cells to monocytes, which agreed with other experimental evidence. The second data set consisted of differentially expressed genes following androgen receptor stimulation in a prostate cancer cell line. Based on the analysis we put forward a hypothesis about the biological processes induced in these studied cells: secretory lysosomes are involved in the production of prostatic fluid and their development and/or secretion are androgen-regulated processes. CONCLUSION: Our method can be used to analyze DNA microarray datasets based on information explicitly and implicitly available in the literature. We provide a publicly available tool, dubbed Anni, for this purpose

    Re-Annotation Is an Essential Step in Systems Biology Modeling of Functional Genomics Data

    Get PDF
    One motivation of systems biology research is to understand gene functions and interactions from functional genomics data such as that derived from microarrays. Up-to-date structural and functional annotations of genes are an essential foundation of systems biology modeling. We propose that the first essential step in any systems biology modeling of functional genomics data, especially for species with recently sequenced genomes, is gene structural and functional re-annotation. To demonstrate the impact of such re-annotation, we structurally and functionally re-annotated a microarray developed, and previously used, as a tool for disease research. We quantified the impact of this re-annotation on the array based on the total numbers of structural- and functional-annotations, the Gene Annotation Quality (GAQ) score, and canonical pathway coverage. We next quantified the impact of re-annotation on systems biology modeling using a previously published experiment that used this microarray. We show that re-annotation improves the quantity and quality of structural- and functional-annotations, allows a more comprehensive Gene Ontology based modeling, and improves pathway coverage for both the whole array and a differentially expressed mRNA subset. Our results also demonstrate that re-annotation can result in a different knowledge outcome derived from previous published research findings. We propose that, because of this, re-annotation should be considered to be an essential first step for deriving value from functional genomics data
    corecore