1,372 research outputs found

    Statistical Methods for Integrative Analysis, Subgroup Identification, and Variable Selection Using Cancer Genomic Data

    Get PDF
    In recent years, comprehensive cancer genomics platform, such as The Cancer Genome Atlas (TCGA), provides access to an enormous amount of high throughput genomic datasets for each patient, including gene expression, DNA copy number alteration, DNA methylation, and somatic mutation. Currently most existing analysis approaches focused only on gene-level analysis and suffered from limited interpretability and low reproducibility of findings. Additionally, with increasing availability of the modern compositional data including immune cellular fraction data and high-dimensional zero-inflated microbiome data, variable selection techniques for compositional data became of great interest because they allow inference of key immune cell types (immunology data) and key microbial species (microbiome data) associated with development and progression of various diseases. In the first dissertation aim, we address these challenges by developing a Bayesian sparse latent factor model for pathway-guided integrative genomic data analysis. Specifically, we constructed a unified framework to simultaneously identify cancer patient subgroups (clustering) and key molecular markers (variable selection) based on the joint analysis of continuous, binary and count data. In addition, we applied Polya-Gamma mixtures of normal for binary and count data to promote an exact and fully automatic posterior sampling. Moreover, pathway information was used to improve accuracy and robustness in identification of cancer patient subgroups and key molecular features. In the second dissertation aim, we developed the R package InGRiD , a comprehensive software for pathway-guided integrative genomic data analysis. We further implemented the statistical model developed in Aim 1 and provide it as a part of this software. The third dissertation aim exploits variable selection in compositional data analysis with application to immunology data and microbiome data. Specifically, we identified key immune cell types by applying a stepwise pairwise log-ratio procedure to the immune cellular fractions data, while selecting key species in the microbiome data by using zero-inflated Wilcoxon rank sum test. These approaches consider key components specific to these data types, such as compositionality (i.e., sum-to-one), zero inflation, and high dimensionality, among others. The proposed methods were developed and evaluated on: 1) large scale, high dimensional, and multi-modal datasets from the TCGA database, including gene expression, DNA copy number alteration, and somatic mutation data (Aim 1); 2) cellular fraction data induced from Colorectal Adenocarcinoma TCGA Pan-Cancer study (Aim 3); 3) high dimensional zero-inflated microbiome data from studies of colorectal cancer (Aim 3)

    Replication in Genome-Wide Association Studies

    Full text link
    Replication helps ensure that a genotype-phenotype association observed in a genome-wide association (GWA) study represents a credible association and is not a chance finding or an artifact due to uncontrolled biases. We discuss prerequisites for exact replication, issues of heterogeneity, advantages and disadvantages of different methods of data synthesis across multiple studies, frequentist vs. Bayesian inferences for replication, and challenges that arise from multi-team collaborations. While consistent replication can greatly improve the credibility of a genotype-phenotype association, it may not eliminate spurious associations due to biases shared by many studies. Conversely, lack of replication in well-powered follow-up studies usually invalidates the initially proposed association, although occasionally it may point to differences in linkage disequilibrium or effect modifiers across studies.Comment: Published in at http://dx.doi.org/10.1214/09-STS290 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Identification of a biomarker panel for colorectal cancer diagnosis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Malignancies arising in the large bowel cause the second largest number of deaths from cancer in the Western World. Despite progresses made during the last decades, colorectal cancer remains one of the most frequent and deadly neoplasias in the western countries.</p> <p>Methods</p> <p>A genomic study of human colorectal cancer has been carried out on a total of 31 tumoral samples, corresponding to different stages of the disease, and 33 non-tumoral samples. The study was carried out by hybridisation of the tumour samples against a reference pool of non-tumoral samples using Agilent Human 1A 60-mer oligo microarrays. The results obtained were validated by qRT-PCR. In the subsequent bioinformatics analysis, gene networks by means of Bayesian classifiers, variable selection and bootstrap resampling were built. The consensus among all the induced models produced a hierarchy of dependences and, thus, of variables.</p> <p>Results</p> <p>After an exhaustive process of pre-processing to ensure data quality--lost values imputation, probes quality, data smoothing and intraclass variability filtering--the final dataset comprised a total of 8, 104 probes. Next, a supervised classification approach and data analysis was carried out to obtain the most relevant genes. Two of them are directly involved in cancer progression and in particular in colorectal cancer. Finally, a supervised classifier was induced to classify new unseen samples.</p> <p>Conclusions</p> <p>We have developed a tentative model for the diagnosis of colorectal cancer based on a biomarker panel. Our results indicate that the gene profile described herein can discriminate between non-cancerous and cancerous samples with 94.45% accuracy using different supervised classifiers (AUC values in the range of 0.997 and 0.955).</p
    corecore