15 research outputs found

    GeneRIF indexing: sentence selection based on machine learning

    Get PDF

    Extraction of semantic biomedical relations from text using conditional random fields

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The increasing amount of published literature in biomedicine represents an immense source of knowledge, which can only efficiently be accessed by a new generation of automated information extraction tools. Named entity recognition of well-defined objects, such as genes or proteins, has achieved a sufficient level of maturity such that it can form the basis for the next step: the extraction of relations that exist between the recognized entities. Whereas most early work focused on the mere detection of relations, the classification of the type of relation is also of great importance and this is the focus of this work. In this paper we describe an approach that extracts both the existence of a relation and its type. Our work is based on Conditional Random Fields, which have been applied with much success to the task of named entity recognition.</p> <p>Results</p> <p>We benchmark our approach on two different tasks. The first task is the identification of semantic relations between diseases and treatments. The available data set consists of manually annotated PubMed abstracts. The second task is the identification of relations between genes and diseases from a set of concise phrases, so-called GeneRIF (Gene Reference Into Function) phrases. In our experimental setting, we do not assume that the entities are given, as is often the case in previous relation extraction work. Rather the extraction of the entities is solved as a subproblem. Compared with other state-of-the-art approaches, we achieve very competitive results on both data sets. To demonstrate the scalability of our solution, we apply our approach to the complete human GeneRIF database. The resulting gene-disease network contains 34758 semantic associations between 4939 genes and 1745 diseases. The gene-disease network is publicly available as a machine-readable RDF graph.</p> <p>Conclusion</p> <p>We extend the framework of Conditional Random Fields towards the annotation of semantic relations from text and apply it to the biomedical domain. Our approach is based on a rich set of textual features and achieves a performance that is competitive to leading approaches. The model is quite general and can be extended to handle arbitrary biological entities and relation types. The resulting gene-disease network shows that the GeneRIF database provides a rich knowledge source for text mining. Current work is focused on improving the accuracy of detection of entities as well as entity boundaries, which will also greatly improve the relation extraction performance.</p

    Gene-Disease Network Analysis Reveals Functional Modules in Mendelian, Complex and Environmental Diseases

    Get PDF
    Scientists have been trying to understand the molecular mechanisms of diseases to design preventive and therapeutic strategies for a long time. For some diseases, it has become evident that it is not enough to obtain a catalogue of the disease-related genes but to uncover how disruptions of molecular networks in the cell give rise to disease phenotypes. Moreover, with the unprecedented wealth of information available, even obtaining such catalogue is extremely difficult. We developed a comprehensive gene-disease association database by integrating associations from several sources that cover different biomedical aspects of diseases. In particular, we focus on the current knowledge of human genetic diseases including mendelian, complex and environmental diseases. To assess the concept of modularity of human diseases, we performed a systematic study of the emergent properties of human gene-disease networks by means of network topology and functional annotation analysis. The results indicate a highly shared genetic origin of human diseases and show that for most diseases, including mendelian, complex and environmental diseases, functional modules exist. Moreover, a core set of biological pathways is found to be associated with most human diseases. We obtained similar results when studying clusters of diseases, suggesting that related diseases might arise due to dysfunction of common biological processes in the cell. For the first time, we include mendelian, complex and environmental diseases in an integrated gene-disease association database and show that the concept of modularity applies for all of them. We furthermore provide a functional analysis of disease-related modules providing important new biological insights, which might not be discovered when considering each of the gene-disease association repositories independently. Hence, we present a suitable framework for the study of how genetic and environmental factors, such as drugs, contribute to diseases. The gene-disease networks used in this study and part of the analysis are available at http://ibi.imim.es/DisGeNET/DisGeNETweb.html#Download

    Systems Integration of Biodefense Omics Data for Analysis of Pathogen-Host Interactions and Identification of Potential Targets

    Get PDF
    The NIAID (National Institute for Allergy and Infectious Diseases) Biodefense Proteomics program aims to identify targets for potential vaccines, therapeutics, and diagnostics for agents of concern in bioterrorism, including bacterial, parasitic, and viral pathogens. The program includes seven Proteomics Research Centers, generating diverse types of pathogen-host data, including mass spectrometry, microarray transcriptional profiles, protein interactions, protein structures and biological reagents. The Biodefense Resource Center (www.proteomicsresource.org) has developed a bioinformatics framework, employing a protein-centric approach to integrate and support mining and analysis of the large and heterogeneous data. Underlying this approach is a data warehouse with comprehensive protein + gene identifier and name mappings and annotations extracted from over 100 molecular databases. Value-added annotations are provided for key proteins from experimental findings using controlled vocabulary. The availability of pathogen and host omics data in an integrated framework allows global analysis of the data and comparisons across different experiments and organisms, as illustrated in several case studies presented here. (1) The identification of a hypothetical protein with differential gene and protein expressions in two host systems (mouse macrophage and human HeLa cells) infected by different bacterial (Bacillus anthracis and Salmonella typhimurium) and viral (orthopox) pathogens suggesting that this protein can be prioritized for additional analysis and functional characterization. (2) The analysis of a vaccinia-human protein interaction network supplemented with protein accumulation levels led to the identification of human Keratin, type II cytoskeletal 4 protein as a potential therapeutic target. (3) Comparison of complete genomes from pathogenic variants coupled with experimental information on complete proteomes allowed the identification and prioritization of ten potential diagnostic targets from Bacillus anthracis. The integrative analysis across data sets from multiple centers can reveal potential functional significance and hidden relationships between pathogen and host proteins, thereby providing a systems approach to basic understanding of pathogenicity and target identification

    Hepatitis C virus infection protein network

    Get PDF
    A proteome-wide mapping of interactions between hepatitis C virus (HCV) and human proteins was performed to provide a comprehensive view of the cellular infection. A total of 314 protein–protein interactions between HCV and human proteins was identified by yeast two-hybrid and 170 by literature mining. Integration of this data set into a reconstructed human interactome showed that cellular proteins interacting with HCV are enriched in highly central and interconnected proteins. A global analysis on the basis of functional annotation highlighted the enrichment of cellular pathways targeted by HCV. A network of proteins associated with frequent clinical disorders of chronically infected patients was constructed by connecting the insulin, Jak/STAT and TGFβ pathways with cellular proteins targeted by HCV. CORE protein appeared as a major perturbator of this network. Focal adhesion was identified as a new function affected by HCV, mainly by NS3 and NS5A proteins

    From Text to Knowledge

    Get PDF
    The global information space provided by the World Wide Web has changed dramatically the way knowledge is shared all over the world. To make this unbelievable huge information space accessible, search engines index the uploaded contents and provide efficient algorithmic machinery for ranking the importance of documents with respect to an input query. All major search engines such as Google, Yahoo or Bing are keyword-based, which is indisputable a very powerful tool for accessing information needs centered around documents. However, this unstructured, document-oriented paradigm of the World Wide Web has serious drawbacks, when searching for specific knowledge about real-world entities. When asking for advanced facts about entities, today's search engines are not very good in providing accurate answers. Hand-built knowledge bases such as Wikipedia or its structured counterpart DBpedia are excellent sources that provide common facts. However, these knowledge bases are far from being complete and most of the knowledge lies still buried in unstructured documents. Statistical machine learning methods have the great potential to help to bridge the gap between text and knowledge by (semi-)automatically transforming the unstructured representation of the today's World Wide Web to a more structured representation. This thesis is devoted to reduce this gap with Probabilistic Graphical Models. Probabilistic Graphical Models play a crucial role in modern pattern recognition as they merge two important fields of applied mathematics: Graph Theory and Probability Theory. The first part of the thesis will present a novel system called Text2SemRel that is able to (semi-)automatically construct knowledge bases from textual document collections. The resulting knowledge base consists of facts centered around entities and their relations. Essential part of the system is a novel algorithm for extracting relations between entity mentions that is based on Conditional Random Fields, which are Undirected Probabilistic Graphical Models. In the second part of the thesis, we will use the power of Directed Probabilistic Graphical Models to solve important knowledge discovery tasks in semantically annotated large document collections. In particular, we present extensions of the Latent Dirichlet Allocation framework that are able to learn in an unsupervised way the statistical semantic dependencies between unstructured representations such as documents and their semantic annotations. Semantic annotations of documents might refer to concepts originating from a thesaurus or ontology but also to user-generated informal tags in social tagging systems. These forms of annotations represent a first step towards the conversion to a more structured form of the World Wide Web. In the last part of the thesis, we prove the large-scale applicability of the proposed fact extraction system Text2SemRel. In particular, we extract semantic relations between genes and diseases from a large biomedical textual repository. The resulting knowledge base contains far more potential disease genes exceeding the number of disease genes that are currently stored in curated databases. Thus, the proposed system is able to unlock knowledge currently buried in the literature. The literature-derived human gene-disease network is subject of further analysis with respect to existing curated state of the art databases. We analyze the derived knowledge base quantitatively by comparing it with several curated databases with regard to size of the databases and properties of known disease genes among other things. Our experimental analysis shows that the facts extracted from the literature are of high quality

    Dual Convolutional Neural Networks With Attention Mechanisms Based Method for Predicting Disease-Related lncRNA Genes

    Get PDF
    A lot of studies indicated that aberrant expression of long non-coding RNA genes (lncRNAs) is closely related to human diseases. Identifying disease-related lncRNAs (disease lncRNAs) is critical for understanding the pathogenesis and etiology of diseases. Most of the previous methods focus on prioritizing the potential disease lncRNAs based on shallow learning methods. The methods fail to extract the deep and complex feature representations of lncRNA-disease associations. Furthermore, nearly all the methods ignore the discriminative contributions of the similarity, association, and interaction relationships among lncRNAs, disease, and miRNAs for the association prediction. A dual convolutional neural networks with attention mechanisms based method is presented for predicting the candidate disease lncRNAs, and it is referred to as CNNLDA. CNNLDA deeply integrates the multiple source data like the lncRNA similarities, the disease similarities, the lncRNA-disease associations, the lncRNA-miRNA interactions, and the miRNA-disease associations. The diverse biological premises about lncRNAs, miRNAs, and diseases are combined to construct the feature matrix from the biological perspectives. A novel framework based on the dual convolutional neural networks is developed to learn the global and attention representations of the lncRNA-disease associations. The left part of the framework exploits the various information contained by the feature matrix to learn the global representation of lncRNA-disease associations. The different connection relationships among the lncRNA, miRNA, and disease nodes and the different features of these nodes have the discriminative contributions for the association prediction. Hence we present the attention mechanisms from the relationship level and the feature level respectively, and the right part of the framework learns the attention representation of associations. The experimental results based on the cross validation indicate that CNNLDA yields superior performance than several state-of-the-art methods. Case studies on stomach cancer, lung cancer, and colon cancer further demonstrate CNNLDA's ability to discover the potential disease lncRNAs

    Ontology-based methods for disease similarity estimation and drug repositioning

    Get PDF
    Title from PDF of title page, viewed on October 2, 2012Dissertation advisor: Deendayal DinakarpandianVitaIncludes bibliographic references (p. 174-181)Thesis (Ph.D.)--School of Computing and Engineering and Dept. of Mathematics and Statistics. University of Missouri--Kansas City, 2012Human genome sequencing and new biological data generation techniques have provided an opportunity to uncover mechanisms in human disease. Using gene-disease data, recent research has increasingly shown that many seemingly dissimilar diseases have similar/common molecular mechanisms. Understanding similarity between diseases aids in early disease diagnosis and development of new drugs. The growing collection of gene-function and gene-disease data has instituted a need for formal knowledge representation in order to extract information. Ontologies have been successfully applied to represent such knowledge, and data mining techniques have been applied on them to extract information. Informatics methods can be used with ontologies to find similarity between diseases which can yield insight into how they are caused. This can lead to therapies which can actually cure diseases rather than merely treating symptoms. Estimating disease similarity solely on the basis of shared genes can be misleading as variable combinations of genes may be associated with similar diseases, especially for complex diseases. This deficiency can be potentially overcome by looking for common or similar biological processes rather than only explicit gene matches between diseases. The use of semantic similarity between biological processes to estimate disease similarity could enhance the identification and characterization of disease similarity besides indentifying novel biological processes involved in the diseases. Also, if diseases have similar molecular mechanisms, then drugs that are currently being used could potentially be used against diseases beyond their original indication. This can greatly benefit patients with diseases that do not have adequate therapies especially people with rare diseases. This can also drastically reduce healthcare costs as development of new drugs is far more expensive than re-using existing ones. In this research we present functions to measure similarity between terms in an ontology, and between entities annotated with terms drawn from the ontology, based on both co-occurrence and information content. The new similarity measure is shown to outperform existing methods using biological pathways. The similarity measure is then used to estimate similarity among diseases using the biological processes involved in them and is evaluated using a manually curated and external datasets with known disease similarities. Further, we use ontologies to encode diseases, drugs and biological processes and demonstrate a method that uses a network-based algorithm to combine biological data about diseases with drug information to find new uses for existing drugs. The effectiveness of the method is demonstrated by comparing the predicted new disease-drug pairs with existing drug-related clinical trials.Introduction and motivation -- Ontologies in biomedical domain -- Methods to compute ontological similarity -- Proposed approach for ontological term similarity -- Augmentation of vocabulary and annotation in ontologies -- Estimation of disease similarity -- Use of ontologies for drug repositioning -- Future directions-perspective from pharmaceutical industry -- Appendix 1. Table for the ontological similarity scores -- Appendix 2. Test set of 200 records for evaluating mapping of disease text to Disease Ontology -- Appendix 3. Curated set of disease similarities used as the benchmark set -- Appendix 4. F-scores for different combinations of Score-Pvalues and GO-Process-Pvalues for PSB estimates of disease similarity -- Appendix 5. Test set formed from opinions of medical residents http://rxinformatics.umn.edu/SemanticRelatednessResources.html -- Appendix 6. Drug repositioning candidate

    A dynamic epigenetic network

    Full text link
    Los contactos 3D de la cromatina son capaces de afectar a las enfermedades al interactuar con diferentes variantes, dentro o fuera de los genes (enhancers, promotores de otros genes, etc), especialmente dentro los bucles de la cromatina. El objetivo de este proyecto es integrar la información de los bucles y las variantes para una mejor caracterización de la LLC (Leucemia Linfocítica Crónica). Exploramos si los contactos de la cromatina en 3D ayudan a explicar mecánicamente el efecto de las variantes asociadas a la LLC, mediante la integración de datos 2D procedentes de estudios de asociación en el catálogo GWAS y en DISGENET, y también datos 3D, concretamente datos Hi-C (método all-vs-all para capturar la conformación del cromosoma), procedentes de archivos bam con secuencias alineadas. Convertimos estos archivos al formato adecuado para poder ejecutar dos loop callers diferentes: PEAKACHU y MUSTACHE. A continuación, conservamos los bucles más significativos de ambos loop callers, filtramos las variantes procedentes de GWAS y DISGENET e intersectamos ambos tipos de datos juntos, 2D y 3D. Al final, anotamos las regiones de cromatina en contacto con los SNPs, algunas con más de un bucle, y analizamos, el contexto genómico de las variantes, y el contexto genómico del bucle en contacto con la variante. El análisis, que tiene en cuenta la temporalidad de la progresión de la enfermedad, revela el efecto de los SNPs en los bucles de cromatina y viceversa en la LLC, y la diferenciación de células B en los genes que se ven afectados por las variantes en los contactos de los bucles de la cromatina
    corecore