23,241 research outputs found

    Differential meta-analysis of RNA-seq data from multiple studies

    Get PDF
    High-throughput sequencing is now regularly used for studies of the transcriptome (RNA-seq), particularly for comparisons among experimental conditions. For the time being, a limited number of biological replicates are typically considered in such experiments, leading to low detection power for differential expression. As their cost continues to decrease, it is likely that additional follow-up studies will be conducted to re-address the same biological question. We demonstrate how p-value combination techniques previously used for microarray meta-analyses can be used for the differential analysis of RNA-seq data from multiple related studies. These techniques are compared to a negative binomial generalized linear model (GLM) including a fixed study effect on simulated data and real data on human melanoma cell lines. The GLM with fixed study effect performed well for low inter-study variation and small numbers of studies, but was outperformed by the meta-analysis methods for moderate to large inter-study variability and larger numbers of studies. To conclude, the p-value combination techniques illustrated here are a valuable tool to perform differential meta-analyses of RNA-seq data by appropriately accounting for biological and technical variability within studies as well as additional study-specific effects. An R package metaRNASeq is available on the R Forge

    Integrative approaches for differential analysis of transcriptome data

    Get PDF
    Department of Biological SciencesThe high-throughput sequencing technologies have produced a huge amount of omics data. Myriads of computational methods have been developed to analyze such data efficiently and accurately. In particular, recently developed single-cell sequencing technologies provided highly sparse and noisy data, further necessitating development of data analysis methods. As large amounts of omics data accumulate in public repositories, it has become the common practice to collect multiple datasets with the same theme (e.g., disease) and integrate them to increase the power of analysis. Because the data from individual studies differ in size, technologies, experimenters, and many other environmental factors, they often exhibit systematic differences in distribution, which is called batch effects. Thus, how to handle batch effects is crucial in integrative omics data analysis. This dissertation investigates computational methods to identify genes differentially expressed between different biological conditions from transcriptome data and how to integrate the analyses across different samples (batches). In Chapter 2, performance of 12 differential expression [DE] analysis methods for RNA sequencing (RNA-seq) data was compared. These methods include the widely used R packages such as edgeR, DESeq2 and limma as well as their recent variants. The benchmark data include RNA-spike-in, simulated read counts, and real RNA-seq data. Extensive conditions such as proportion of DE genes, sample sizes, presence of random outliers, mean and dispersion estimates were tested for simulated data. We analyzed the impact of each factor to overall performance of DE analysis and suggested suitable methods for each test condition. DESeq2, a robust version of edgeR and voom with TMM normalization exhibited overall good performance. In Chapter 3, two novel meta-analysis methods that are capable of capturing ???incomplete association??? were proposed. Incomplete association represents the coexistence of ???associated??? and ???unassociated??? statistics in the list of summary statistics obtained from different studies in integrative analysis. Meta-analysis integrates the summary statistics from different individual study to increase the statistical power. We demonstrated that the power of conventional meta-analysis methods rapidly decreased as the number of unassociated statistics increased. The classical Fisher???s method and the newly proposed weighted Fisher???s method (wFisher) effectively detected these incomplete associations. Another method, dubbed ordmeta, employed joint distribution of ordered p-values and also showed outperforming results in detecting incomplete associations. wFisher and ordmeta exclusively detected genes with high biological relevance from meta-analysis with prostate cancer gene expression data. Lastly, integrative DE analysis methods for single-cell RNA-seq (scRNA-seq) data were compared. In total, 41 computational pipelines that combine batch-effects correction methods, covariate modeling, and DE analysis methods were tested using simulation and real data. In particular, the single-cell RNA-seq data for seven patients with lung adenocarcinoma were analyzed. Remarkably, analysis of epithelialcells in scRNA-seq data outperformed the analysis large-scale bulk RNA-seq data available from the Cancer Genome Atlas in detecting known lung cancer genes and prognostic genes. Furthermore, GSEA analysis revealed distinct aspects of enriched pathways between epithelial cell and bulk RNA-seq data analyses.ope

    Differential expression and feature selection in the analysis of multiple omics studies

    Get PDF
    With the rapid advances of high-throughput technologies in the past decades, various kinds of omics data have been generated from many labs and accumulated in the public domain. These studies have been designed for different biological purposes, including the identification of differentially expressed genes, the selection of predictive biomarkers, etc. Effective meta-analysis of omics data from multiple studies can improve statistical power, accuracy and reproducibility of single study. This dissertation covered a few methods for differential expression (Chapter 2 and 3) and feature selection (Chapter 4) in the analysis of multiple omics studies. In Chapter 2, we proposed a full Bayesian hierarchical model for RNA-seq meta-analysis by modeling count data, integrating information across genes and across studies, and modeling differential signals across studies via latent variables. A Dirichlet process mixture prior was further applied on the latent variables to provide categorization of detected biomarkers according to their differential expression patterns across studies. We used both simulations and a real application on multiple brain region HIV-1 transgenic rats to demonstrate improved sensitivity, accuracy and biological findings of our method. In Chapter 3, we extended the previous Bayesian model to jointly integrate transcriptomic data from the two platforms: microarray and RNA-seq. In Chapter 4, we considered a general framework for variable screening with multiple omics studies and further proposed a novel two-step screening procedure for high-dimensional regression analysis in this framework. Compared to the one-step procedure and rank-based sure independence screening procedure, our procedure greatly reduced false negative errors while keeping a low false positive rate. Theoretically, we showed that our procedure possesses the sure screening property with weaker assumptions on signal strengths and allows the number of features to grow at an exponential rate of the sample size. Public health significance: The proposed methods are useful in detecting important biomarkers that are either differentially expressed or predictive of clinical outcomes. This is essential for searching for potential drug targets and understanding the disease mechanism. Such findings in basic science can be translated into preventive medicine or potential treatment for disease to promote human health and improve the global healthcare system

    The ROS wheel: refining ROS transcriptional footprints

    Get PDF
    In the last decade, microarray studies have delivered extensive inventories of transcriptome-wide changes in messenger RNA levels provoked by various types of oxidative stress in Arabidopsis (Arabidopsis thaliana). Previous cross-study comparisons indicated how different types of reactive oxygen species (ROS) and their subcellular accumulation sites are able to reshape the transcriptome in specific manners. However, these analyses often employed simplistic statistical frameworks that are not compatible with large-scale analyses. Here, we reanalyzed a total of 79 Affymetrix ATH1 microarray studies of redox homeostasis perturbation experiments. To create hierarchy in such a high number of transcriptomic data sets, all transcriptional profiles were clustered on the overlap extent of their differentially expressed transcripts. Subsequently, meta-analysis determined a single magnitude of differential expression across studies and identified common transcriptional footprints per cluster. The resulting transcriptional footprints revealed the regulation of various metabolic pathways and gene families. The RESPIRATORY BURST OXIDASE HOMOLOG F-mediated respiratory burst had a major impact and was a converging point among several studies. Conversely, the timing of the oxidative stress response was a determining factor in shaping different transcriptome footprints. Our study emphasizes the need to interpret transcriptomic data sets in a systematic context, where initial, specific stress triggers can converge to common, aspecific transcriptional changes. We believe that these refined transcriptional footprints provide a valuable resource for assessing the involvement of ROS in biological processes in plants

    Essential guidelines for computational method benchmarking

    Get PDF
    In computational biology and other sciences, researchers are frequently faced with a choice between several computational methods for performing data analyses. Benchmarking studies aim to rigorously compare the performance of different methods using well-characterized benchmark datasets, to determine the strengths of each method or to provide recommendations regarding suitable choices of methods for an analysis. However, benchmarking studies must be carefully designed and implemented to provide accurate, unbiased, and informative results. Here, we summarize key practical guidelines and recommendations for performing high-quality benchmarking analyses, based on our experiences in computational biology.Comment: Minor update
    corecore