430 research outputs found

    Ontology-driven indexing of public datasets for translational bioinformatics

    Get PDF
    The volume of publicly available genomic scale data is increasing. Genomic datasets in public repositories are annotated with free-text fields describing the pathological state of the studied sample. These annotations are not mapped to concepts in any ontology, making it difficult to integrate these datasets across repositories. We have previously developed methods to map text-annotations of tissue microarrays to concepts in the NCI thesaurus and SNOMED-CT

    Annotation and query of tissue microarray data using the NCI Thesaurus

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The Stanford Tissue Microarray Database (TMAD) is a repository of data serving a consortium of pathologists and biomedical researchers. The tissue samples in TMAD are annotated with multiple free-text fields, specifying the pathological diagnoses for each sample. These text annotations are not structured according to any ontology, making future integration of this resource with other biological and clinical data difficult.</p> <p>Results</p> <p>We developed methods to map these annotations to the NCI thesaurus. Using the NCI-T we can effectively represent annotations for about 86% of the samples. We demonstrate how this mapping enables ontology driven integration and querying of tissue microarray data. We have deployed the mapping and ontology driven querying tools at the TMAD site for general use.</p> <p>Conclusion</p> <p>We have demonstrated that we can effectively map the diagnosis-related terms describing a sample in TMAD to the NCI-T. The NCI thesaurus terms have a wide coverage and provide terms for about 86% of the samples. In our opinion the NCI thesaurus can facilitate integration of this resource with other biological data.</p

    Finding disease similarity based on implicit semantic similarity

    Get PDF
    AbstractGenomics has contributed to a growing collection of geneā€“function and geneā€“disease annotations that can be exploited by informatics to study similarity between diseases. This can yield insight into disease etiology, reveal common pathophysiology and/or suggest treatment that can be appropriated from one disease to another. Estimating disease similarity solely on the basis of shared genes can be misleading as variable combinations of genes may be associated with similar diseases, especially for complex diseases. This deficiency can be potentially overcome by looking for common biological processes rather than only explicit gene matches between diseases. The use of semantic similarity between biological processes to estimate disease similarity could enhance the identification and characterization of disease similarity. We present functions to measure similarity between terms in an ontology, and between entities annotated with terms drawn from the ontology, based on both co-occurrence and information content. The similarity measure is shown to outperform other measures used to detect similarity. A manually curated dataset with known disease similarities was used as a benchmark to compare the estimation of disease similarity based on gene-based and Gene Ontology (GO) process-based comparisons. The detection of disease similarity based on semantic similarity between GO Processes (Recall=55%, Precision=60%) performed better than using exact matches between GO Processes (Recall=29%, Precision=58%) or gene overlap (Recall=88% and Precision=16%). The GO-Process based disease similarity scores on an external test set show statistically significant Pearson correlation (0.73) with numeric scores provided by medical residents. GO-Processes associated with similar diseases were found to be significantly regulated in gene expression microarray datasets of related diseases

    Advances in Gene Ontology Utilization Improve Statistical Power of Annotation Enrichment

    Get PDF
    Gene-annotation enrichment is a common method for utilizing ontology-based annotations in gene and gene-product centric knowledgebases. Effective utilization of these annotations requires inferring semantic linkages by tracing paths through edges in the ontological graph, referred to as relations. However, some relations are semantically problematic with respect to scope, necessitating their omission or modification lest erroneous term mappings occur. To address these issues, we created the Gene Ontology Categorization Suite, or GOcatsā€”a novel tool that organizes the Gene Ontology into subgraphs representing user-defined concepts, while ensuring that all appropriate relations are congruent with respect to scoping semantics. Here, we demonstrate the improvements in annotation enrichment by re-interpreting edges that would otherwise be omitted by traditional ancestor path-tracing methods. Specifically, we show that GOcatsā€™ unique handling of relations improves enrichment over conventional methods in the analysis of two different gene-expression datasets: a breast cancer microarray dataset and several horse cartilage development RNAseq datasets. With the breast cancer microarray dataset, we observed significant improvement (one-sided binomial test p-value = 1.86E-25) in 182 of 217 significantly enriched GO terms identified from the conventional path traversal method when GOcatsā€™ path traversal was used. We also found new significantly enriched terms using GOcats, whose biological relevancy has been experimentally demonstrated elsewhere. Likewise, on the horse RNAseq datasets, we observed a significant improvement in GO term enrichment when using GOcatā€™s path traversal: one-sided binomial test p-values range from 1.32E-03 to 2.58E-44

    Doctor of Philosophy

    Get PDF
    dissertationGene expression data repositories provide large and ever increasing data for secondary use by translational informatics methods. For example, Gene Expression Omnibus (GEO) houses over 37,000 experiments with the goal of supporting further research. To use these published results in a larger meta-analysis, consolidation of the data are needed; however, the data are largely unstructured, thus hindering data integration efforts. Here, I propose the use of a novel pipeline, Ontology Based Data Integration (OBDI), which uses an ontological approach to combine the samples across multiple GEO experiments. The ODBI pipeline uses machine learning algorithms that permit researchers to consolidate and analyze data across GEO experiments. Here, I demonstrate how using an ontological approach to integrate samples across experiments can be used to explore the immune response at a molecular level. As part of this process, a Web Ontology Language (OWL) was developed for each data platform used. OWL serves as a core component in successfully processing different sample types. Immunological experiments from GEO were consolidated to evaluate this methodology. The experiments included samples analyzed on expression arrays, BeadChips, and sequencing technologies. The integration of a complex biological system and the incorporation of different biological data types will validate the potential of OBDI. iv The nature of biological data is highly dimensional. OBDI incorporates tools and techniques that can handle the analysis of various biological data. The machine learning analysis performed within the OBDI pipeline successfully evaluated the newly annotated experiments and provides insights that can be further explored. The OBDI pipeline can help researchers annotate experiments using ontologies and analyze the annotated experiments. To successfully build the pipeline, ontologies served as the backbone of integrating samples from GEO Series records into machine learning experiments using ML-Flex. By using the OBDI pipeline, researchers can access the uncurated experiments from GEO (GEO Data Series) and annotate the data using the terms in the ontologies. This mechanism allows for the organization of data sets in relationship to new experiments independent of GEO's GDS curation process. The OBDI system allows ontologies to grow organically around a cluster of experiments. These experiments are then further analyzed in ML-Flex using machine learning algorithms. The curated experiments are analyzed in silico and the computational analyses are supported by the OBDI ontological system

    Comparison of automated literature based gene-disease association using gene set enrichment analysis

    Full text link
    Cancer is a leading cause of death in Australia: more than 43,000 people have been estimated to have died from cancer in 2010. However, the genetic causes of cancer remain elusive despite voluminous genetic data in the public domain. Our goal is to identify genes in order to understand the molecular mechanisms of cancer so that diagnosis, prognosis and treatment can be optimized. Microarrays measure gene expression levels in disease tissue relative to normal tissue. However, microarray data are noisy and computational methods are required to associate aberrant gene expression with disease. Subramanian et al. (2005) developed an approach called Gene Set Enrichment Analysis (GSEA) that annotates microarray data with functional terms from a background ontology. The enriched gene sets have shown to improve the quality of microarray annotation compared to single gene annotation. Nevertheless, GSEA falls short when used to predict disease-gene associations. We hypothesized that GSEAā€™s shortfall is caused by limited knowledge embedded in its ontology. Thus we have proposed a novel method, which automatically constructs ontologies for use in GSEA directly from the biomedical literature and then associates genes with diseases. This thesis tests this hypothesis. My results show that using knowledge derived automatically from biomedical literature outperforms GSEAā€™s default catalogues and achieves high area under the receiver operating characteristic curve (AUC) scores when tested on breast and colorectal cancer samples. The results indicate that the automated literature-based approach is a promising method for discovering novel gene-disease associations. In conclusion, I have shown that literature-based generated catalogues are accurate and viable for prediction of gene-disease associations

    Prioritizing human cancer microRNAs based on genesā€™ functional consistency between microRNA and cancer

    Get PDF
    The identification of human cancer-related microRNAs (miRNAs) is important for cancer biology research. Although several identification methods have achieved remarkable success, they have overlooked the functional information associated with miRNAs. We present a computational framework that can be used to prioritize human cancer miRNAs by measuring the association between cancer and miRNAs based on the functional consistency score (FCS) of the miRNA target genes and the cancer-related genes. This approach proved successful in identifying the validated cancer miRNAs for 11 common human cancers with area under ROC curve (AUC) ranging from 71.15% to 96.36%. The FCS method had a significant advantage over miRNA differential expression analysis when identifying cancer-related miRNAs with a fine regulatory mechanism, such as miR-27a in colorectal cancer. Furthermore, a case study examining thyroid cancer showed that the FCS method can uncover novel cancer-related miRNAs such as miR-27a/b, which were showed significantly upregulated in thyroid cancer samples by qRT-PCR analysis. Our method can be used on a web-based server, CMP (cancer miRNA prioritization) and is freely accessible at http://bioinfo.hrbmu.edu.cn/CMP. This time- and cost-effective computational framework can be a valuable complement to experimental studies and can assist with future studies of miRNA involvement in the pathogenesis of cancers

    Post-transcriptional knowledge in pathway analysis increases the accuracy of phenotypes classification

    Get PDF
    Motivation: Prediction of phenotypes from high-dimensional data is a crucial task in precision biology and medicine. Many technologies employ genomic biomarkers to characterize phenotypes. However, such elements are not sufficient to explain the underlying biology. To improve this, pathway analysis techniques have been proposed. Nevertheless, such methods have shown lack of accuracy in phenotypes classification. Results: Here we propose a novel methodology called MITHrIL (Mirna enrIched paTHway Impact anaLysis) for the analysis of signaling pathways, which has built on top of the work of Tarca et al., 2009. MITHrIL extends pathways by adding missing regulatory elements, such as microRNAs, and their interactions with genes. The method takes as input the expression values of genes and/or microRNAs and returns a list of pathways sorted according to their deregulation degree, together with the corresponding statistical significance (p-values). Our analysis shows that MITHrIL outperforms its competitors even in the worst case. In addition, our method is able to correctly classify sets of tumor samples drawn from TCGA. Availability: MITHrIL is freely available at the following URL: http://alpha.dmi.unict.it/mithril
    • ā€¦
    corecore