36 research outputs found

    New Statistical Learning Approaches with Applications to RNA-seq Data

    Get PDF
    This dissertation examines statistical learning problems in both the supervised and unsupervised settings. The dissertation is composed of three major parts. In the first two, we address the important question of significance of clustering, and in the third, we describe a novel framework for unifying hard and soft classification through a spectrum of binary learning problems. In the unsupervised task of clustering, determining whether the identified clusters represent important underlying structure, or are artifacts of natural sampling variation, has been a critical and challenging question. In this dissertation, we introduce two new methods for addressing this question using statistical significance. In the first part of the dissertation, we describe SigFuge, an approach for identifying genomic loci exhibiting differential transcription patterns across many RNA-seq samples. In the second part of this dissertation, we describe statistical Significance of Hierarchical Clustering (SHC), a Monte Carlo based approach for testing significance in hierarchical clustering, and demonstrate the power of the method to identify significant clustering using two cancer gene expression datasets. Both methods were implemented and made available as open source packages in R. In the final part of this dissertation, we propose a spectrum of supervised learning problems which spans the hard and soft classification tasks based on fitting multiple decision rules to a dataset. By doing so, we reveal a novel collection of binary supervised learning problems. We study the problems using the framework of large-margin classification and a class of piecewise linear surrogate losses, for which we derive statistical properties. We evaluate our approach using simulations and a magnetic resonance imaging (MRI) dataset from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study.Doctor of Philosoph

    SigFuge: Single gene clustering of RNA-seq reveals differential isoform usage among cancer samples

    Get PDF
    High-throughput sequencing technologies, including RNA-seq, have made it possible to move beyond gene expression analysis to study transcriptional events including alternative splicing and gene fusions. Furthermore, recent studies in cancer have suggested the importance of identifying transcriptionally altered loci as biomarkers for improved prognosis and therapy. While many statistical methods have been proposed for identifying novel transcriptional events with RNA-seq, nearly all rely on contrasting known classes of samples, such as tumor and normal. Few tools exist for the unsupervised discovery of such events without class labels. In this paper, we present SigFuge for identifying genomic loci exhibiting differential transcription patterns across many RNA-seq samples. SigFuge combines clustering with hypothesis testing to identify genes exhibiting alternative splicing, or differences in isoform expression. We apply SigFuge to RNA-seq cohorts of 177 lung and 279 head and neck squamous cell carcinoma samples from the Cancer Genome Atlas, and identify several cases of differential isoform usage including CDKN2A, a tumor suppressor gene known to be inactivated in a majority of lung squamous cell tumors. By not restricting attention to known sample stratifications, SigFuge offers a novel approach to unsupervised screening of genetic loci across RNA-seq cohorts. SigFuge is available as an R package through Bioconductor

    Comprehensive Molecular Characterization of Pheochromocytoma and Paraganglioma

    Get PDF
    SummaryWe report a comprehensive molecular characterization of pheochromocytomas and paragangliomas (PCCs/PGLs), a rare tumor type. Multi-platform integration revealed that PCCs/PGLs are driven by diverse alterations affecting multiple genes and pathways. Pathogenic germline mutations occurred in eight PCC/PGL susceptibility genes. We identified CSDE1 as a somatically mutated driver gene, complementing four known drivers (HRAS, RET, EPAS1, and NF1). We also discovered fusion genes in PCCs/PGLs, involving MAML3, BRAF, NGFR, and NF1. Integrated analysis classified PCCs/PGLs into four molecularly defined groups: a kinase signaling subtype, a pseudohypoxia subtype, a Wnt-altered subtype, driven by MAML3 and CSDE1, and a cortical admixture subtype. Correlates of metastatic PCCs/PGLs included the MAML3 fusion gene. This integrated molecular characterization provides a comprehensive foundation for developing PCC/PGL precision medicine

    Reproducible and replicable comparisons of methods controlling false discoveries in computational biology.

    No full text
    With the advancement of high-throughput technologies, data and computing have become key components of scientific discovery in biology. New computational methods to analyze genomic data are constantly being developed, with several methods often addressing the same biological question. As a result, researchers are now faced with the challenge of deciding between a plethora of tools, each leading to slightly different answers. For several common analyses in computational biology, benchmark comparisons have been published to help users pick an appropriate tool from a subset of alternatives. Despite the popularity of these comparisons, the implementation is often ad hoc, with little consistency across studies. To address this problem, we developed SummarizedBenchmark, an R package and framework for organizing and structuring benchmark comparisons. SummarizedBenchmark defines a general grammar for benchmarking and allows for easier setup and execution of benchmark comparisons, while improving the reproducibility and replicability of such comparisons. Using this framework, we perform a systematic benchmark of several recently developed false discovery rate (FDR)-controlling methods for multiple testing correction. These modern methods have the potential to improve power in biological studies by leveraging additional pieces of information available in the data ("informative covariates") to prioritize, weight, and group hypotheses. We investigate the advantages and limitations of these methods against classical FDR-controlling methods across six biological cases studies and various simulation settings. We provide a summary of our findings as a practical guide to aid users in the choice of methods to correct for false discoveries in future studies.Non UBCUnreviewedAuthor affiliation: Dana-Farber Cancer InstitutePostdoctora
    corecore