151 research outputs found

    Accurate and Reliable Cancer Classi cation Based on Pathway-Markers and Subnetwork-Markers

    Get PDF
    Finding reliable gene markers for accurate disease classification is very challenging due to a number of reasons, including the small sample size of typical clinical data, high noise in gene expression measurements, and the heterogeneity across patients. In fact, gene markers identified in independent studies often do not coincide with each other, suggesting that many of the predicted markers may have no biological significance and may be simply artifacts of the analyzed dataset. To nd more reliable and reproducible diagnostic markers, several studies proposed to analyze the gene expression data at the level of groups of functionally related genes, such as pathways. Given a set of known pathways, these methods estimate the activity level of each pathway by summarizing the expression values of its member genes and using the pathway activities for classification. One practical problem of the pathway-based approach is the limited coverage of genes by currently known pathways. As a result, potentially important genes that play critical roles in cancer development may be excluded. In this thesis, we first propose a probabilistic model to infer pathway/subnetwork activities. After that, we developed a novel method for identifying reliable subnetwork markers in a human protein-protein interaction (PPI) network based on probabilistic inference of subnetwork activities. We tested the proposed methods based on two independent breast cancer datasets. The proposed method can efficiently find reliable subnetwork markers that outperform the gene-based and pathway-based markers in terms of discriminative power, reproducibility and classification performance. The identified subnetwork markers are highly enriched in common GO terms, and they can more accurately classify breast cancer metastasis compared to markers found by a previous method

    Identifying noncoding risk variants using disease-relevant gene regulatory networks.

    Get PDF
    Identifying noncoding risk variants remains a challenging task. Because noncoding variants exert their effects in the context of a gene regulatory network (GRN), we hypothesize that explicit use of disease-relevant GRNs can significantly improve the inference accuracy of noncoding risk variants. We describe Annotation of Regulatory Variants using Integrated Networks (ARVIN), a general computational framework for predicting causal noncoding variants. It employs a set of novel regulatory network-based features, combined with sequence-based features to infer noncoding risk variants. Using known causal variants in gene promoters and enhancers in a number of diseases, we show ARVIN outperforms state-of-the-art methods that use sequence-based features alone. Additional experimental validation using reporter assay further demonstrates the accuracy of ARVIN. Application of ARVIN to seven autoimmune diseases provides a holistic view of the gene subnetwork perturbed by the combinatorial action of the entire set of risk noncoding mutations. Nat Commun 2018 Feb 16; 9(1):702

    Computational approaches to find transcriptomic and epigenomic signatures of latent TB in HIV patients

    Get PDF
    Abstract: HIV infection promotes the progression of latent infection of Mtb to the active disease with the primary challenge of diagnosis being the development of efficient and sensitive methods to detect latent TB in HIV infected individuals. Previous studies have identified transcriptional signatures for active TB along with signatures predicting the risk of active TB disease in latent TB infected individuals or those with other diseases. Existing studies have also identified characteristic genes for active TB in HIV infected patients. However, no studies have identified predictive transcriptional signatures that discriminate latent TB from active TB disease in HIV positive persons as well epigenetic mechanisms associated with latent TB/HIV coinfection. The aim of this study was to develop a computational pipeline using statistical modelling and machine learning (ML) methods to identify a transcriptomic signature associated with latent TB in HIV positive patients and to identify candidate epigenetic modifications for future studies. A novel pipeline, that leverages statistical differential expression analyses (OPLS-DA) and supervised ML and feature selection methods, was applied to an existing transcriptomic dataset (NCBI GEO repository accession number GSE37250) and the outcome of the two methodologies were integrated to define a gene signature characterising the progression of latent to active TB in HIV infected patients. Enrichment analysis was performed on the transcriptomic panel of genes to predict candidate epigenetic marks in latent TB/HIV coinfection. An 11-gene minimal signature was identified of which the expression levels discriminate between latent TB and active TB in HIV positive patients. A broader analysis of DEGs identified by the ML and OPLS-DA revealed enrichment of pathways related to T- and B-cell receptor signalling, metabolic processes, insulin signalling, endocrine resistance and ATP-binding. Candidate epigenetic alterations associated with latent TB in the HIV positive cohort were identified using transcription factor (TF), histone modification (HM) and miRNA enrichment analyses. This novel integrative approach to identify a discriminative latent TB gene signature provided new insights into the response mechanism of HIV co-infection with Mtb, and pathways that merit further investigation was identified. The genes of interest identified may provide novel diagnostic and therapeutic targets for latent TB in patients who are HIV positive.M.Sc. (Biochemistry

    Identification and Functional Annotation of Alternatively Spliced Isoforms

    Full text link
    Alternative splicing is a key mechanism for increasing the complexity of transcriptome and proteome in eukaryotic cells. A large portion of multi-exon genes in humans undergo alternative splicing, and this can have significant functional consequences as the proteins translated from alternatively spliced mRNA might have different amino acid sequences and structures. The study of alternative splicing events has been accelerated by the next-generation sequencing technology. However, reconstruction of transcripts from short-read RNA sequencing is not sufficiently accurate. Recent progress in single-molecule long-read sequencing has provided researchers alternative ways to help solve this problem. With the help of both short and long RNA sequencing technologies, tens of thousands of splice isoforms have been catalogued in humans and other species, but relatively few of the protein products of splice isoforms have been characterized functionally, structurally and biochemically. The scope of this dissertation includes using short and long RNA sequencing reads together for the purpose of transcript reconstruction, and using high-throughput RNA-sequencing data and gene ontology functional annotations on gene level to predict functions for alternatively spliced isoforms in mouse and human. In the first chapter, I give an introduction of alternative splicing and discuss the existing studies where next generation sequencing is used for transcript identification. Then, I define the isoform function prediction problem, and explain how it differs from better known gene function prediction problem. In the second chapter of this dissertation, I describe our study where the overall transcriptome of kidney is studied using both long reads from PacBio platform and RNA-seq short reads from Illumina platform. We used short reads to validate full-length transcripts found by long PacBio reads, and generated two high quality sets of transcript isoforms that are expressed in glomerular and tubulointerstitial compartments. In the third chapter, I describe our generic framework, where we implemented and evaluated several related algorithms for isoform function prediction for mouse isoforms. We tested these algorithms through both computational evaluation and experimental validation of the predicted ‘responsible’ isoform(s) and the predicted disparate functions of the isoforms of Cdkn2a and of Anxa6. Our algorithm is the first effort to predict and differentiate isoform functions through large-scale genomic data integration. In the fourth chapter, I present the extension of isoform function prediction study to the protein coding isoforms in human. We used a similar multiple instance learning (MIL)-based approach for predicting the function of protein coding splice variants in human. We evaluated our predictions using literature evidence of ADAM15, LMNA/C, and DMXL2 genes. And in the fifth and final chapter, I give a summary of previous chapters and outline the future directions for alternatively spliced isoform reconstruction and function prediction studies.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144017/1/ridvan_1.pd

    GeNetOntology: identifying affected gene ontology terms via grouping, scoring, and modeling of gene expression data utilizing biological knowledge-based machine learning

    Get PDF
    Introduction: Identifying significant sets of genes that are up/downregulated under specific conditions is vital to understand disease development mechanisms at the molecular level. Along this line, in order to analyze transcriptomic data, several computational feature selection (i.e., gene selection) methods have been proposed. On the other hand, uncovering the core functions of the selected genes provides a deep understanding of diseases. In order to address this problem, biological domain knowledge-based feature selection methods have been proposed. Unlike computational gene selection approaches, these domain knowledge-based methods take the underlying biology into account and integrate knowledge from external biological resources. Gene Ontology (GO) is one such biological resource that provides ontology terms for defining the molecular function, cellular component, and biological process of the gene product.Methods: In this study, we developed a tool named GeNetOntology which performs GO-based feature selection for gene expression data analysis. In the proposed approach, the process of Grouping, Scoring, and Modeling (G-S-M) is used to identify significant GO terms. GO information has been used as the grouping information, which has been embedded into a machine learning (ML) algorithm to select informative ontology terms. The genes annotated with the selected ontology terms have been used in the training part to carry out the classification task of the ML model. The output is an important set of ontologies for the two-class classification task applied to gene expression data for a given phenotype.Results: Our approach has been tested on 11 different gene expression datasets, and the results showed that GeNetOntology successfully identified important disease-related ontology terms to be used in the classification model.Discussion: GeNetOntology will assist geneticists and scientists to identify a range of disease-related genes and ontologies in transcriptomic data analysis, and it will also help doctors design diagnosis platforms and improve patient treatment plans

    Improving the understanding of cancer in a descriptive way: An emerging pattern mining-based approach

    Get PDF
    This paper presents an approach based on emerging pattern mining to analyse cancer through genomic data. Unlike existing approaches, mainly focused on predictive purposes, the proposal aims to improve the understanding of cancer descriptively, not requiring either any prior knowledge or hypothesis to be validated. Additionally, it enables to consider high-order relationships, so not only essential genes related to the disease are considered, but also the combined effect of various secondary genes that can influence different pathways directly or indirectly related to the disease. The prime hypothesis is that splitting genomic cancer data into two subsets, that is, cases and controls, will allow us to determine which genes, and their expressions, are associated with different cancer types. The possibilities of the proposal are demonstrated by analyzing RNA-Seq data for six different types of cancer: breast, colon, lung, thyroid, prostate, and kidney. Some of the extracted insights were already described in the related literature as good cancer bio-markers, while others have not been described yet mainly due to existing techniques are biased by prior knowledge provided by biological databases

    MACHINE LEARNING APPROACHES FOR BIOMARKER IDENTIFICATION AND SUBGROUP DISCOVERY FOR POST-TRAUMATIC STRESS DISORDER

    Get PDF
    Post-traumatic stress disorder (PTSD) is a psychiatric disorder caused by environmental and genetic factors resulting from alterations in genetic variation, epigenetic changes and neuroimaging characteristics. There is a pressing need to identify reliable molecular and physiological biomarkers for accurate diagnosis, prognosis, and treatment, as well to deepen the understanding of PTSD pathophysiology. Machine learning methods are widely used to infer patterns from biological data, identify biomarkers, and make predictions. The objective of this research is to apply machine learning methods for the accurate classification of human diseases from genome-scale datasets, focusing primarily on PTSD.The DoD-funded Systems Biology of PTSD Consortium has recruited combat veterans with and without PTSD for measurement of molecular and physiological data from blood or urine samples with the goal of identifying accurate and specific PTSD biomarkers. As a member of the Consortium with access to these PTSD multiple omics datasets, we first completed a project titled Clinical Subgroup-Specific PTSD Classification and Biomarker Discovery. We applied machine learning approaches to these data to build classification models consisting of molecular and clinical features to predict PTSD status. We also identified candidate biomarkers for diagnosis, which improves our understanding of PTSD pathogenesis. In a second project, entitled Multi-Omic PTSD Subgroup Identification and Clinical Characterization, we applied methods for integrating multiple omics datasets to investigate the complex, multivariate nature of the biological systems underlying PTSD. We identified an optimal 2 PTSD subgroups using two different machine learning approaches from 82 PTSD positive samples, and we found that the subgroups exhibited different remitting behavior as inferred from subjects recalled at a later time point. The results from our association, differential expression, and classification analyses demonstrated the distinct clinical and molecular features characterizing these subgroups.Taken together, our work has advanced our understanding of PTSD biomarkers and subgroups through the use of machine learning approaches. Results from our work should strongly contribute to the precise diagnosis and eventual treatment of PTSD, as well as other diseases. Future work will involve continuing to leverage these results to enable precision medicine for PTSD

    Application of miRNA-seq in neuropsychiatry: A methodological perspective

    Get PDF
    MiRNAs are emerging as key molecules to study neuropsychiatric diseases. However, despite the large number of methodologies and software for miRNA-seq analyses, there is little supporting literature for researchers in this area. This review focuses on evaluating how miRNA-seq has been used to study neuropsychiatric diseases to date, analyzing both the main findings discovered and the bioinformatics workflows and tools used from a methodological perspective. The objective of this review is two-fold: first, to evaluate current miRNA-seq procedures used in neuropsychiatry; and second, to offer comprehensive information that can serve as a guide to new researchers in bioinformatics. After conducting a systematic search (from 2016 to June 30, 2020) of articles using miRNA-seq in neuropsychiatry, we have seen that it has already been used for different types of studies in three main categories: diagnosis, prognosis, and mechanism. We carefully analyzed the bioinformatics workflows of each study, observing a high degree of variability with respect to the tools and methods used and several methodological complexities that are identified and discussed in this reviewInstituto de Salud Carlos III | Ref. PI18/01311Ministerio de Economía y Competitividad | Ref. RYC2014-15246Xunta de Galicia | Ref. ED431C2018/55-GR
    corecore