15 research outputs found

    Finding disease similarity based on implicit semantic similarity

    Get PDF
    AbstractGenomics has contributed to a growing collection of gene–function and gene–disease annotations that can be exploited by informatics to study similarity between diseases. This can yield insight into disease etiology, reveal common pathophysiology and/or suggest treatment that can be appropriated from one disease to another. Estimating disease similarity solely on the basis of shared genes can be misleading as variable combinations of genes may be associated with similar diseases, especially for complex diseases. This deficiency can be potentially overcome by looking for common biological processes rather than only explicit gene matches between diseases. The use of semantic similarity between biological processes to estimate disease similarity could enhance the identification and characterization of disease similarity. We present functions to measure similarity between terms in an ontology, and between entities annotated with terms drawn from the ontology, based on both co-occurrence and information content. The similarity measure is shown to outperform other measures used to detect similarity. A manually curated dataset with known disease similarities was used as a benchmark to compare the estimation of disease similarity based on gene-based and Gene Ontology (GO) process-based comparisons. The detection of disease similarity based on semantic similarity between GO Processes (Recall=55%, Precision=60%) performed better than using exact matches between GO Processes (Recall=29%, Precision=58%) or gene overlap (Recall=88% and Precision=16%). The GO-Process based disease similarity scores on an external test set show statistically significant Pearson correlation (0.73) with numeric scores provided by medical residents. GO-Processes associated with similar diseases were found to be significantly regulated in gene expression microarray datasets of related diseases

    Hybrid approach for disease comorbidity and disease gene prediction using heterogeneous dataset

    Get PDF
    High throughput analysis and large scale integration of biological data led to leading researches in the field of bioinformatics. Recent years witnessed the development of various methods for disease associated gene prediction and disease comorbidity predictions. Most of the existing techniques use network-based approaches and similarity-based approaches for these predictions. Even though network-based approaches have better performance, these methods rely on text data from OMIM records and PubMed abstracts. In this method, a novel algorithm (HDCDGP) is proposed for disease comorbidity prediction and disease associated gene prediction. Disease comorbidity network and disease gene network were constructed using data from gene ontology (GO), human phenotype ontology (HPO), protein-protein interaction (PPI) and pathway dataset. Modified random walk restart algorithm was applied on these networks for extracting novel disease-gene associations. Experimental results showed that the hybrid approach has better performance compared to existing systems with an overall accuracy around 85%

    Evaluating Wikipedia as a source of information for disease understanding

    Full text link
    The increasing availability of biological data is improving our understanding of diseases and providing new insight into their underlying relationships. Thanks to the improvements on both text mining techniques and computational capacity, the combination of biological data with semantic information obtained from medical publications has proven to be a very promising path. However, the limitations in the access to these data and their lack of structure pose challenges to this approach. In this document we propose the use of Wikipedia - the free online encyclopedia - as a source of accessible textual information for disease understanding research. To check its validity, we compare its performance in the determination of relationships between diseases with that of PubMed, one of the most consulted data sources of medical texts. The obtained results suggest that the information extracted from Wikipedia is as relevant as that obtained from PubMed abstracts (i.e. the free access portion of its articles), although further research is proposed to verify its reliability for medical studies.Comment: 6 pages, 5 figures, 5 tables, published at IEEE CBMS 2018, 2018 IEEE 31st International Symposium on Computer-Based Medical Systems (CBMS

    Towards optimize-ESA for text semantic similarity: A case study of biomedical text

    Get PDF
    Explicit Semantic Analysis (ESA) is an approach to measure the semantic relatedness between terms or documents based on similarities to documents of a references corpus usually Wikipedia. ESA usage has received tremendous attention in the field of natural language processing NLP and information retrieval. However, ESA utilizes a huge Wikipedia index matrix in its interpretation by multiplying a large matrix by a term vector to produce a high-dimensional vector. Consequently, the ESA process is too expensive in interpretation and similarity steps. Therefore, the efficiency of ESA will slow down because we lose a lot of time in unnecessary operations. This paper propose enhancements to ESA called optimize-ESA that reduce the dimension at the interpretation stage by computing the semantic similarity in a specific domain. The experimental results show clearly that our method correlates much better with human judgement than the full version ESA approach

    Sélection Robuste de Mesures de Similarité Sémantique à partir de Données Incertaines d'Expertise

    Get PDF
    National audienceKnowledge-based semantic measures are cornerstone to exploit ontologies not only for exact inferences or retrieval processes, but also for data analyses and inexact searches. Abstract theoretical frameworks have recently been proposed in order to study the large diversity of measures available; they demonstrate that groups of measures are particular instantiations of general parameterized functions. In this paper, we study how such frameworks can be used to support the selection/design of measures. Based on (i) a theoretical framework unifying the measures, (ii) a software solution implementing this framework and (iii) a domain-specific benchmark, we define a semi-supervised learning technique to distinguish best measures for a concrete application. Next, considering uncertainty in both experts’ judgments and measures’ selection process, we extend this proposal for robust selection of semantic measures that best resists to these uncertainties. We illustrate our approach through a real use case in the biomedical domain..L'exploitation d'ontologies pour la recherche d'information, la découverte de connaissances ou le raisonnement approché nécessite l'utilisation de mesures sémantiques qui permettent d'estimer le degré de similarité entre des entités lexicales ou conceptuelles. Récemment un cadre théorique abstrait a été proposé afin d'unifier la grande diversité de ces mesures, au travers de fonctions paramétriques générales. Cet article propose une utilisation de ce cadre unificateur pour choisir une mesure. A partir du (i) cadre unificateur exprimant les mesures basées sur un ensemble limité de primitives, (ii) logiciel implémentant ce cadre et (iii) benchmark d'un domaine spécifique, nous utilisons une technique d'apprentissage semi-supervisé afin de fournir la meilleure mesure sémantique pour une application donnée. Ensuite, sachant que les données fournies par les experts sont entachées d'incertitude, nous étendons notre approche pour choisir la plus robuste parmi les meilleures mesures, i.e. la moins perturbée par les erreurs d'évaluation experte. Nous illustrons notre approche par une application dans le domaine biomédical. Mots-clés: Cadre unificateur, robustesse de mesures, incertitude d'expert, mesures de similarité sémantique, ontologies

    Evaluating the Consistency of Gene Sets Used in the Analysis of Bacterial Gene Expression Data

    Get PDF
    Background Statistical analyses of whole genome expression data require functional information about genes in order to yield meaningful biological conclusions. The Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) are common sources of functionally grouped gene sets. For bacteria, the SEED and MicrobesOnline provide alternative, complementary sources of gene sets. To date, no comprehensive evaluation of the data obtained from these resources has been performed. Results We define a series of gene set consistency metrics directly related to the most common classes of statistical analyses for gene expression data, and then perform a comprehensive analysis of 3581 Affymetrix gene expression arrays across 17 diverse bacteria. We find that gene sets obtained from GO and KEGG demonstrate lower consistency than those obtained from the SEED and MicrobesOnline, regardless of gene set size. Conclusions Despite the widespread use of GO and KEGG gene sets in bacterial gene expression data analysis, the SEED and MicrobesOnline provide more consistent sets for a wide variety of statistical analyses such data. Increased use of the SEED and MicrobesOnline gene sets in the analysis of bacterial gene expression data may improve statistical power and utility of expression data

    An Effective Method to Measure Disease Similarity Using Gene and Phenotype Associations

    Get PDF
    Motivation: In order to create controlled vocabularies for shared use in different biomedical domains, a large number of biomedical ontologies such as Disease Ontology (DO) and Human Phenotype Ontology (HPO), etc., are created in the bioinformatics community. Quantitative measures of the associations among diseases could help researchers gain a deep insight of human diseases, since similar diseases are usually caused by similar molecular origins or have similar phenotypes, which is beneficial to reveal the common attributes of diseases and improve the corresponding diagnoses and treatment plans. Some previous are proposed to measure the disease similarity using a particular biomedical ontology during the past few years, but for a newly discovered disease or a disease with few related genetic information in Disease Ontology (i.e., a disease with less disease-gene associations), these previous approaches usually ignores the joint computation of disease similarity by integrating gene and phenotype associations.Results: In this paper we propose a novel method called GPSim to effectively deduce the semantic similarity of diseases. In particular, GPSim calculates the similarity by jointly utilizing gene, disease and phenotype associations extracted from multiple biomedical ontologies and databases. We also explore the phenotypic factors such as the depth of HPO terms and the number of phenotypic associations that affect the evaluation performance. A final experimental evaluation is carried out to evaluate the performance of GPSim and shows its advantages over previous approaches

    Gene2DisCo : gene to disease using disease commonalities

    Get PDF
    OBJECTIVE: Finding the human genes co-causing complex diseases, also known as "disease-genes", is one of the emerging and challenging tasks in biomedicine. This process, termed gene prioritization (GP), is characterized by a scarcity of known disease-genes for most diseases, and by a vast amount of heterogeneous data, usually encoded into networks describing different types of functional relationships between genes. In addition, different diseases may share common profiles (e.g. genetic or therapeutic profiles), and exploiting disease commonalities may significantly enhance the performance of GP methods. This work aims to provide a systematic comparison of several disease similarity measures, and to embed disease similarities and heterogeneous data into a flexible framework for gene prioritization which specifically handles the lack of known disease-genes. METHODS: We present a novel network-based method, Gene2DisCo, based on generalized linear models (GLMs) to effectively prioritize genes by exploiting data regarding disease-genes, gene interaction networks and disease similarities. The scarcity of disease-genes is addressed by applying an efficient negative selection procedure, together with imbalance-aware GLMs. Gene2DisCo is a flexible framework, in the sense it is not dependent upon specific types of data, and/or upon specific disease ontologies. RESULTS: On a benchmark dataset composed of nine human networks and 708 medical subject headings (MeSH) diseases, Gene2DisCo largely outperformed the best benchmark algorithm, kernelized score functions, in terms of both area under the ROC curve (0.94 against 0.86) and precision at given recall levels (for recall levels from 0.1 to 1 with steps 0.1). Furthermore, we enriched and extended the benchmark data to the whole human genome and provided the top-ranked unannotated candidate genes even for MeSH disease terms without known annotations
    corecore