1,206 research outputs found

    The Interaction Network Ontology-Supported Modeling and Mining of Complex Interactions Represented with Multiple Keywords in Biomedical Literature

    Get PDF
    Background: The Interaction Network Ontology (INO) logically represents biological interactions, pathways, and networks. INO has been demonstrated to be valuable in providing a set of structured ontological terms and associated keywords to support literature mining of gene-gene interactions from biomedical literature. However, previous work using INO focused on single keyword matching, while many interactions are represented with two or more interaction keywords used in combination. Methods: This paper reports our extension of INO to include combinatory patterns of two or more literature mining keywords co-existing in one sentence to represent specific INO interaction classes. Such keyword combinations and related INO interaction type information could be automatically obtained via SPARQL queries, formatted in Excel format, and used in an INO-supported SciMiner, an in-house literature mining program. We studied the gene interaction sentences from the commonly used benchmark Learning Logic in Language (LLL) dataset and one internally generated vaccine-related dataset to identify and analyze interaction types containing multiple keywords. Patterns obtained from the dependency parse trees of the sentences were used to identify the interaction keywords that are related to each other and collectively represent an interaction type. Results: The INO ontology currently has 575 terms including 202 terms under the interaction branch. The relations between the INO interaction types and associated keywords are represented using the INO annotation relations: ‘has literature mining keywords’ and ‘has keyword dependency pattern’. The keyword dependency patterns were generated via running the Stanford Parser to obtain dependency relation types. Out of the 107 interactions in the LLL dataset represented with two-keyword interaction types, 86 were identified by using the direct dependency relations. The LLL dataset contained 34 gene regulation interaction types, each of which associated with multiple keywords. A hierarchical display of these 34 interaction types and their ancestor terms in INO resulted in the identification of specific gene-gene interaction patterns from the LLL dataset. The phenomenon of having multi-keyword interaction types was also frequently observed in the vaccine dataset. Conclusions: By modeling and representing multiple textual keywords for interaction types, the extended INO enabled the identification of complex biological gene-gene interactions represented with multiple keywords

    Extension of the Interaction Network Ontology for literature mining of gene-gene interaction networks from sentences with multiple interaction keywords

    Get PDF
    Abstract. The Interaction Network Ontology (INO) has been demonstrated to be valuable in providing a structured ontological vocabulary for literature mining of gene-gene interactions from biomedical literature. Our analysis of the Learning Logic in Language (LLL) challenge and vaccine datasets showed that many interactions are signaled with 2 or more interaction keywords used in combination. In this paper, we extend the INO by adding combinatory patterns of two or more literature mining keywords to related INO interaction classes. An INO-based literature mining pipeline was further developed based on SPARQL queries and SciMiner, an in-house literature mining program. The majority of the gene interaction sentences from the LLL and vaccine datasets were found to use multiple keywords to represent interaction types. A comprehensive analysis of the LLL dataset identified 27 gene regulation interaction types each associated with multiple keywords. Special patterns were discovered from the hierarchical structure of these 27 INO types

    Knowledge-based Biomedical Data Science 2019

    Full text link
    Knowledge-based biomedical data science (KBDS) involves the design and implementation of computer systems that act as if they knew about biomedicine. Such systems depend on formally represented knowledge in computer systems, often in the form of knowledge graphs. Here we survey the progress in the last year in systems that use formally represented knowledge to address data science problems in both clinical and biological domains, as well as on approaches for creating knowledge graphs. Major themes include the relationships between knowledge graphs and machine learning, the use of natural language processing, and the expansion of knowledge-based approaches to novel domains, such as Chinese Traditional Medicine and biodiversity.Comment: Manuscript 43 pages with 3 tables; Supplemental material 43 pages with 3 table

    Text Mining and Gene Expression Analysis Towards Combined Interpretation of High Throughput Data

    Get PDF
    Microarrays can capture gene expression activity for thousands of genes simultaneously and thus make it possible to analyze cell physiology and disease processes on molecular level. The interpretation of microarray gene expression experiments profits from knowledge on the analyzed genes and proteins and the biochemical networks in which they play a role. The trend is towards the development of data analysis methods that integrate diverse data types. Currently, the most comprehensive biomedical knowledge source is a large repository of free text articles. Text mining makes it possible to automatically extract and use information from texts. This thesis addresses two key aspects, biomedical text mining and gene expression data analysis, with the focus on providing high-quality methods and data that contribute to the development of integrated analysis approaches. The work is structured in three parts. Each part begins by providing the relevant background, and each chapter describes the developed methods as well as applications and results. Part I deals with biomedical text mining: Chapter 2 summarizes the relevant background of text mining; it describes text mining fundamentals, important text mining tasks, applications and particularities of text mining in the biomedical domain, and evaluation issues. In Chapter 3, a method for generating high-quality gene and protein name dictionaries is described. The analysis of the generated dictionaries revealed important properties of individual nomenclatures and the used databases (Fundel and Zimmer, 2006). The dictionaries are publicly available via a Wiki, a web service, and several client applications (Szugat et al., 2005). In Chapter 4, methods for the dictionary-based recognition of gene and protein names in texts and their mapping onto unique database identifiers are described. These methods make it possible to extract information from texts and to integrate text-derived information with data from other sources. Three named entity identification systems have been set up, two of them building upon the previously existing tool ProMiner (Hanisch et al., 2003). All of them have shown very good performance in the BioCreAtIvE challenges (Fundel et al., 2005a; Hanisch et al., 2005; Fundel and Zimmer, 2007). In Chapter 5, a new method for relation extraction (Fundel et al., 2007) is presented. It was applied on the largest collection of biomedical literature abstracts, and thus a comprehensive network of human gene and protein relations has been generated. A classification approach (Küffner et al., 2006) can be used to specify relation types further; e. g., as activating, direct physical, or gene regulatory relation. Part II deals with gene expression data analysis: Gene expression data needs to be processed so that differentially expressed genes can be identified. Gene expression data processing consists of several sequential steps. Two important steps are normalization, which aims at removing systematic variances between measurements, and quantification of differential expression by p-value and fold change determination. Numerous methods exist for these tasks. Chapter 6 describes the relevant background of gene expression data analysis; it presents the biological and technical principles of microarrays and gives an overview of the most relevant data processing steps. Finally, it provides a short introduction to osteoarthritis, which is in the focus of the analyzed gene expression data sets. In Chapter 7, quality criteria for the selection of normalization methods are described, and a method for the identification of differentially expressed genes is proposed, which is appropriate for data with large intensity variances between spots representing the same gene (Fundel et al., 2005b). Furthermore, a system is described that selects an appropriate combination of feature selection method and classifier, and thus identifies genes which lead to good classification results and show consistent behavior in different sample subgroups (Davis et al., 2006). The analysis of several gene expression data sets dealing with osteoarthritis is described in Chapter 8. This chapter contains the biomedical analysis of relevant disease processes and distinct disease stages (Aigner et al., 2006a), and a comparison of various microarray platforms and osteoarthritis models. Part III deals with integrated approaches and thus provides the connection between parts I and II: Chapter 9 gives an overview of different types of integrated data analysis approaches, with a focus on approaches that integrate gene expression data with manually compiled data, large-scale networks, or text mining. In Chapter 10, a method for the identification of genes which are consistently regulated and have a coherent literature background (Küffner et al., 2005) is described. This method indicates how gene and protein name identification and gene expression data can be integrated to return clusters which contain genes that are relevant for the respective experiment together with literature information that supports interpretation. Finally, in Chapter 11 ideas on how the described methods can contribute to current research and possible future directions are presented

    Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Pharmacogenomics studies the relationship between genetic variation and the variation in drug response phenotypes. The field is rapidly gaining importance: it promises drugs targeted to particular subpopulations based on genetic background. The pharmacogenomics literature has expanded rapidly, but is dispersed in many journals. It is challenging, therefore, to identify important associations between drugs and molecular entities – particularly genes and gene variants, and thus these critical connections are often lost. Text mining techniques can allow us to convert the free-style text to a computable, searchable format in which pharmacogenomic concepts (such as genes, drugs, polymorphisms, and diseases) are identified, and important links between these concepts are recorded. Availability of full text articles as input into text mining engines is key, as literature abstracts often do not contain sufficient information to identify these pharmacogenomic associations.</p> <p>Results</p> <p>Thus, building on a tool called Textpresso, we have created the Pharmspresso tool to assist in identifying important pharmacogenomic facts in full text articles. Pharmspresso parses text to find references to human genes, polymorphisms, drugs and diseases and their relationships. It presents these as a series of marked-up text fragments, in which key concepts are visually highlighted. To evaluate Pharmspresso, we used a gold standard of 45 human-curated articles. Pharmspresso identified 78%, 61%, and 74% of target gene, polymorphism, and drug concepts, respectively.</p> <p>Conclusion</p> <p>Pharmspresso is a text analysis tool that extracts pharmacogenomic concepts from the literature automatically and thus captures our current understanding of gene-drug interactions in a computable form. We have made Pharmspresso available at <url>http://pharmspresso.stanford.edu</url>.</p

    Activity of microRNAs and transcription factors in Gene Regulatory Networks

    Get PDF
    In biological research, diverse high-throughput techniques enable the investigation of whole systems at the molecular level. The development of new methods and algorithms is necessary to analyze and interpret measurements of gene and protein expression and of interactions between genes and proteins. One of the challenges is the integrated analysis of gene expression and the associated regulation mechanisms. The two most important types of regulators, transcription factors (TFs) and microRNAs (miRNAs), often cooperate in complex networks at the transcriptional and post-transcriptional level and, thus, enable a combinatorial and highly complex regulation of cellular processes. For instance, TFs activate and inhibit the expression of other genes including other TFs whereas miRNAs can post-transcriptionally induce the degradation of transcribed RNA and impair the translation of mRNA into proteins. The identification of gene regulatory networks (GRNs) is mandatory in order to understand the underlying control mechanisms. The expression of regulators is itself regulated, i.e. activating or inhibiting regulators in varying conditions and perturbations. Thus, measurements of gene expression following targeted perturbations (knockouts or overexpressions) of these regulators are of particular importance. The prediction of the activity states of the regulators and the prediction of the target genes are first important steps towards the construction of GRNs. This thesis deals with these first bioinformatics steps to construct GRNs. Targets of TFs and miRNAs are determined as comprehensively and accurately as possible. The activity state of regulators is predicted for specific high-throughput data and specific contexts using appropriate statistical approaches. Moreover, (parts of) GRNs are inferred, which lead to explanations of given measurements. The thesis describes new approaches for these tasks together with accompanying evaluations and validations. This immediately defines the three main goals of the current thesis: 1. The development of a comprehensive database of regulator-target relation. Regulators and targets are retrieved from public repositories, extracted from the literature via text mining and collected into the miRSel database. In addition, relations can be predicted using various published methods. In order to determine the activity states of regulators (see 2.) and to infer GRNs (3.) comprehensive and accurate regulator-target relations are required. It could be shown that text mining enables the reliable extraction of miRNA, gene, and protein names as well as their relations from scientific free texts. Overall, the miRSel contains about three times more relations for the model organisms human, mouse, and rat as compared to state-of-the-art databases (e.g. TarBase, one of the currently most used resources for miRNA-target relations). 2. The prediction of activity states of regulators based on improved target sets. In order to investigate mechanisms of gene regulation, the experimental contexts have to be determined in which the respective regulators become active. A regulator is predicted as active based on appropriate statistical tests applied to the expression values of its set of target genes. For this task various gene set enrichment (GSE) methods have been proposed. Unfortunately, before an actual experiment it is unknown which genes are affected. The missing standard-of-truth so far has prevented the systematic assessment and evaluation of GSE tests. In contrast, the trigger of gene expression changes is of course known for experiments where a particular regulator has been directly perturbed (i.e. by knockout, transfection, or overexpression). Based on such datasets, we have systematically evaluated 12 current GSE tests. In our analysis ANOVA and the Wilcoxon test performed best. 3. The prediction of regulation cascades. Using gene expression measurements and given regulator-target relations (e.g. from the miRSel database) GRNs are derived. GSE tests are applied to determine TFs and miRNAs that change their activity as cellular response to an overexpressed miRNA. Gene regulatory networks can constructed iteratively. Our models show how miRNAs trigger gene expression changes: either directly or indirectly via cascades of miRNA-TF, miRNA-kinase-TF as well as TF-TF relations. In this thesis we focus on measurements which have been obtained after overexpression of miRNAs. Surprisingly, a number of cancer relevant miRNAs influence a common core of TFs which are involved in processes such as proliferation and apoptosis

    Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature

    Get PDF
    We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org
    corecore