23,241 research outputs found
Differential meta-analysis of RNA-seq data from multiple studies
High-throughput sequencing is now regularly used for studies of the
transcriptome (RNA-seq), particularly for comparisons among experimental
conditions. For the time being, a limited number of biological replicates are
typically considered in such experiments, leading to low detection power for
differential expression. As their cost continues to decrease, it is likely that
additional follow-up studies will be conducted to re-address the same
biological question. We demonstrate how p-value combination techniques
previously used for microarray meta-analyses can be used for the differential
analysis of RNA-seq data from multiple related studies. These techniques are
compared to a negative binomial generalized linear model (GLM) including a
fixed study effect on simulated data and real data on human melanoma cell
lines. The GLM with fixed study effect performed well for low inter-study
variation and small numbers of studies, but was outperformed by the
meta-analysis methods for moderate to large inter-study variability and larger
numbers of studies. To conclude, the p-value combination techniques illustrated
here are a valuable tool to perform differential meta-analyses of RNA-seq data
by appropriately accounting for biological and technical variability within
studies as well as additional study-specific effects. An R package metaRNASeq
is available on the R Forge
Integrative approaches for differential analysis of transcriptome data
Department of Biological SciencesThe high-throughput sequencing technologies have produced a huge amount of omics data. Myriads of computational methods have been developed to analyze such data efficiently and accurately. In particular, recently developed single-cell sequencing technologies provided highly sparse and noisy data, further necessitating development of data analysis methods. As large amounts of omics data accumulate in public repositories, it has become the common practice to collect multiple datasets with the same theme (e.g., disease) and integrate them to increase the power of analysis. Because the data from individual studies differ in size, technologies, experimenters, and many other environmental factors, they often exhibit systematic differences in distribution, which is called batch effects. Thus, how to handle batch effects is crucial in integrative omics data analysis. This dissertation investigates computational methods to identify genes differentially expressed between different biological conditions from transcriptome data and how to integrate the analyses across different samples (batches).
In Chapter 2, performance of 12 differential expression [DE] analysis methods for RNA sequencing (RNA-seq) data was compared. These methods include the widely used R packages such as edgeR, DESeq2 and limma as well as their recent variants. The benchmark data include RNA-spike-in, simulated read counts, and real RNA-seq data. Extensive conditions such as proportion of DE genes, sample sizes, presence of random outliers, mean and dispersion estimates were tested for simulated data. We analyzed the impact of each factor to overall performance of DE analysis and suggested suitable methods for each test condition. DESeq2, a robust version of edgeR and voom with TMM normalization exhibited overall good performance. In Chapter 3, two novel meta-analysis methods that are capable of capturing ???incomplete association??? were proposed. Incomplete association represents the coexistence of ???associated??? and ???unassociated??? statistics in the list of summary statistics obtained from different studies in integrative analysis. Meta-analysis integrates the summary statistics from different individual study to increase the statistical power. We demonstrated that the power of conventional meta-analysis methods rapidly decreased as the number of unassociated statistics increased. The classical Fisher???s method and the newly proposed weighted Fisher???s method (wFisher) effectively
detected these incomplete associations. Another method, dubbed ordmeta, employed joint distribution of ordered p-values and also showed outperforming results in detecting incomplete associations. wFisher and ordmeta exclusively detected genes with high biological relevance from meta-analysis with prostate cancer gene expression data.
Lastly, integrative DE analysis methods for single-cell RNA-seq (scRNA-seq) data were compared. In total, 41 computational pipelines that combine batch-effects correction methods, covariate modeling, and DE analysis methods were tested using simulation and real data. In particular, the single-cell RNA-seq data for seven patients with lung adenocarcinoma were analyzed. Remarkably, analysis of epithelialcells in scRNA-seq data outperformed the analysis large-scale bulk RNA-seq data available from the Cancer Genome Atlas in detecting known lung cancer genes and prognostic genes. Furthermore, GSEA analysis revealed distinct aspects of enriched pathways between epithelial cell and bulk RNA-seq data analyses.ope
Differential expression and feature selection in the analysis of multiple omics studies
With the rapid advances of high-throughput technologies in the past decades, various kinds of omics data have been generated from many labs and accumulated in the public domain. These studies have been designed for different biological purposes, including the identification of differentially expressed genes, the selection of predictive biomarkers, etc. Effective meta-analysis of omics data from multiple studies can improve statistical power, accuracy and reproducibility of single study. This dissertation covered a few methods for differential expression (Chapter 2 and 3) and feature selection (Chapter 4) in the analysis of multiple omics studies.
In Chapter 2, we proposed a full Bayesian hierarchical model for RNA-seq meta-analysis by modeling count data, integrating information across genes and across studies, and modeling differential signals across studies via latent variables. A Dirichlet process mixture prior was further applied on the latent variables to provide categorization of detected biomarkers according to their differential expression patterns across studies. We used both simulations and a real application on multiple brain region HIV-1 transgenic rats to demonstrate improved sensitivity, accuracy and biological findings of our method. In Chapter 3, we extended the previous Bayesian model to jointly integrate transcriptomic data from the two platforms: microarray and RNA-seq.
In Chapter 4, we considered a general framework for variable screening with multiple omics studies and further proposed a novel two-step screening procedure for high-dimensional regression analysis in this framework. Compared to the one-step procedure and rank-based sure independence screening procedure, our procedure greatly reduced false negative errors while keeping a low false positive rate. Theoretically, we showed that our procedure possesses the sure screening property with weaker assumptions on signal strengths and allows the number of features to grow at an exponential rate of the sample size.
Public health significance:
The proposed methods are useful in detecting important biomarkers that are either differentially expressed or predictive of clinical outcomes. This is essential for searching for potential drug targets and understanding the disease mechanism. Such findings in basic science can be translated into preventive medicine or potential treatment for disease to promote human health and improve the global healthcare system
Recommended from our members
An atlas of cortical circular RNA expression in Alzheimer disease brains demonstrates clinical and pathological associations.
Parietal cortex RNA-sequencing (RNA-seq) data were generated from individuals with and without Alzheimer disease (AD; ncontrol = 13; nAD = 83) from the Knight Alzheimer Disease Research Center (Knight ADRC). Using this and an independent (Mount Sinai Brain Bank (MSBB)) AD RNA-seq dataset, cortical circular RNA (circRNA) expression was quantified in the context of AD. Significant associations were identified between circRNA expression and AD diagnosis, clinical dementia severity and neuropathological severity. It was demonstrated that most circRNA-AD associations are independent of changes in cognate linear messenger RNA expression or estimated brain cell-type proportions. Evidence was provided for circRNA expression changes occurring early in presymptomatic AD and in autosomal dominant AD. It was also observed that AD-associated circRNAs co-expressed with known AD genes. Finally, potential microRNA-binding sites were identified in AD-associated circRNAs for miRNAs predicted to target AD genes. Together, these results highlight the importance of analyzing non-linear RNAs and support future studies exploring the potential roles of circRNAs in AD pathogenesis
The ROS wheel: refining ROS transcriptional footprints
In the last decade, microarray studies have delivered extensive inventories of transcriptome-wide changes in messenger RNA levels provoked by various types of oxidative stress in Arabidopsis (Arabidopsis thaliana). Previous cross-study comparisons indicated how different types of reactive oxygen species (ROS) and their subcellular accumulation sites are able to reshape the transcriptome in specific manners. However, these analyses often employed simplistic statistical frameworks that are not compatible with large-scale analyses. Here, we reanalyzed a total of 79 Affymetrix ATH1 microarray studies of redox homeostasis perturbation experiments. To create hierarchy in such a high number of transcriptomic data sets, all transcriptional profiles were clustered on the overlap extent of their differentially expressed transcripts. Subsequently, meta-analysis determined a single magnitude of differential expression across studies and identified common transcriptional footprints per cluster. The resulting transcriptional footprints revealed the regulation of various metabolic pathways and gene families. The RESPIRATORY BURST OXIDASE HOMOLOG F-mediated respiratory burst had a major impact and was a converging point among several studies. Conversely, the timing of the oxidative stress response was a determining factor in shaping different transcriptome footprints. Our study emphasizes the need to interpret transcriptomic data sets in a systematic context, where initial, specific stress triggers can converge to common, aspecific transcriptional changes. We believe that these refined transcriptional footprints provide a valuable resource for assessing the involvement of ROS in biological processes in plants
Recommended from our members
Allele-specific NKX2-5 binding underlies multiple genetic associations with human electrocardiographic traits.
The cardiac transcription factor (TF) gene NKX2-5 has been associated with electrocardiographic (EKG) traits through genome-wide association studies (GWASs), but the extent to which differential binding of NKX2-5 at common regulatory variants contributes to these traits has not yet been studied. We analyzed transcriptomic and epigenomic data from induced pluripotent stem cell-derived cardiomyocytes from seven related individuals, and identified ~2,000 single-nucleotide variants associated with allele-specific effects (ASE-SNVs) on NKX2-5 binding. NKX2-5 ASE-SNVs were enriched for altered TF motifs, for heart-specific expression quantitative trait loci and for EKG GWAS signals. Using fine-mapping combined with epigenomic data from induced pluripotent stem cell-derived cardiomyocytes, we prioritized candidate causal variants for EKG traits, many of which were NKX2-5 ASE-SNVs. Experimentally characterizing two NKX2-5 ASE-SNVs (rs3807989 and rs590041) showed that they modulate the expression of target genes via differential protein binding in cardiac cells, indicating that they are functional variants underlying EKG GWAS signals. Our results show that differential NKX2-5 binding at numerous regulatory variants across the genome contributes to EKG phenotypes
Essential guidelines for computational method benchmarking
In computational biology and other sciences, researchers are frequently faced
with a choice between several computational methods for performing data
analyses. Benchmarking studies aim to rigorously compare the performance of
different methods using well-characterized benchmark datasets, to determine the
strengths of each method or to provide recommendations regarding suitable
choices of methods for an analysis. However, benchmarking studies must be
carefully designed and implemented to provide accurate, unbiased, and
informative results. Here, we summarize key practical guidelines and
recommendations for performing high-quality benchmarking analyses, based on our
experiences in computational biology.Comment: Minor update
Recommended from our members
Common CHD8 Genomic Targets Contrast With Model-Specific Transcriptional Impacts of CHD8 Haploinsufficiency.
The packaging of DNA into chromatin determines the transcriptional potential of cells and is central to eukaryotic gene regulation. Case sequencing studies have revealed mutations to proteins that regulate chromatin state, known as chromatin remodeling factors, with causal roles in neurodevelopmental disorders. Chromodomain helicase DNA binding protein 8 (CHD8) encodes a chromatin remodeling factor with among the highest de novo loss-of-function mutation rates in patients with autism spectrum disorder (ASD). However, mechanisms associated with CHD8 pathology have yet to be elucidated. We analyzed published transcriptomic data across CHD8 in vitro and in vivo knockdown and knockout models and CHD8 binding across published ChIP-seq datasets to identify convergent mechanisms of gene regulation by CHD8. Differentially expressed genes (DEGs) across models varied, but overlap was observed between downregulated genes involved in neuronal development and function, cell cycle, chromatin dynamics, and RNA processing, and between upregulated genes involved in metabolism and immune response. Considering the variability in transcriptional changes and the cells and tissues represented across ChIP-seq analysis, we found a surprisingly consistent set of high-affinity CHD8 genomic interactions. CHD8 was enriched near promoters of genes involved in basic cell functions and gene regulation. Overlap between high-affinity CHD8 targets and DEGs shows that reduced dosage of CHD8 directly relates to decreased expression of cell cycle, chromatin organization, and RNA processing genes, but only in a subset of studies. This meta-analysis verifies CHD8 as a master regulator of gene expression and reveals a consistent set of high-affinity CHD8 targets across human, mouse, and rat in vivo and in vitro studies. These conserved regulatory targets include many genes that are also implicated in ASD. Our findings suggest a model where perturbation to dosage-sensitive CHD8 genomic interactions with a highly-conserved set of regulatory targets leads to model-specific downstream transcriptional impacts
- …