1,360,278 research outputs found
Differential expression analysis with global network adjustment
<p>Background: Large-scale chromosomal deletions or other non-specific perturbations of the transcriptome can alter the expression of hundreds or thousands of genes, and it is of biological interest to understand which genes are most profoundly affected. We present a method for predicting a gene’s expression as a function of other genes thereby accounting for the effect of transcriptional regulation that confounds the identification of genes differentially expressed relative to a regulatory network. The challenge in constructing such models is that the number of possible regulator transcripts within a global network is on the order of thousands, and the number of biological samples is typically on the order of 10. Nevertheless, there are large gene expression databases that can be used to construct networks that could be helpful in modeling transcriptional regulation in smaller experiments.</p>
<p>Results: We demonstrate a type of penalized regression model that can be estimated from large gene expression databases, and then applied to smaller experiments. The ridge parameter is selected by minimizing the cross-validation error of the predictions in the independent out-sample. This tends to increase the model stability and leads to a much greater degree of parameter shrinkage, but the resulting biased estimation is mitigated by a second round of regression. Nevertheless, the proposed computationally efficient “over-shrinkage” method outperforms previously used LASSO-based techniques. In two independent datasets, we find that the median proportion of explained variability in expression is approximately 25%, and this results in a substantial increase in the signal-to-noise ratio allowing more powerful inferences on differential gene expression leading to biologically intuitive findings. We also show that a large proportion of gene dependencies are conditional on the biological state, which would be impossible with standard differential expression methods.</p>
<p>Conclusions: By adjusting for the effects of the global network on individual genes, both the sensitivity and reliability of differential expression measures are greatly improved.</p>
Trajectory-based differential expression analysis for single-cell sequencing data
Trajectory inference has radically enhanced single-cell RNA-seq research by enabling the study of dynamic changes in gene expression. Downstream of trajectory inference, it is vital to discover genes that are (i) associated with the lineages in the trajectory, or (ii) differentially expressed between lineages, to illuminate the underlying biological processes. Current data analysis procedures, however, either fail to exploit the continuous resolution provided by trajectory inference, or fail to pinpoint the exact types of differential expression. We introduce tradeSeq, a powerful generalized additive model framework based on the negative binomial distribution that allows flexible inference of both within-lineage and between-lineage differential expression. By incorporating observation-level weights, the model additionally allows to account for zero inflation. We evaluate the method on simulated datasets and on real datasets from droplet-based and full-length protocols, and show that it yields biological insights through a clear interpretation of the data. Downstream of trajectory inference for cell lineages based on scRNA-seq data, differential expression analysis yields insight into biological processes. Here, Van den Berge et al. develop tradeSeq, a framework for the inference of within and between-lineage differential expression, based on negative binomial generalized additive models
Differential expression analysis for multiple conditions
As high-throughput sequencing has become common practice, the cost of
sequencing large amounts of genetic data has been drastically reduced, leading
to much larger data sets for analysis. One important task is to identify
biological conditions that lead to unusually high or low expression of a
particular gene. Packages such as DESeq implement a simple method for testing
differential signal when exactly two biological conditions are possible. For
more than two conditions, pairwise testing is typically used. Here the DESeq
method is extended so that three or more biological conditions can be assessed
simultaneously. Because the computation time grows exponentially in the number
of conditions, a Monte Carlo approach provides a fast way to approximate the
-values for the new test. The approach is studied on both simulated data and
a data set of {\em C. jejuni}, the bacteria responsible for most food poisoning
in the United States
Inference for High-dimensional Differential Correlation Matrices
Motivated by differential co-expression analysis in genomics, we consider in
this paper estimation and testing of high-dimensional differential correlation
matrices. An adaptive thresholding procedure is introduced and theoretical
guarantees are given. Minimax rate of convergence is established and the
proposed estimator is shown to be adaptively rate-optimal over collections of
paired correlation matrices with approximately sparse differences. Simulation
results show that the procedure significantly outperforms two other natural
methods that are based on separate estimation of the individual correlation
matrices. The procedure is also illustrated through an analysis of a breast
cancer dataset, which provides evidence at the gene co-expression level that
several genes, of which a subset has been previously verified, are associated
with the breast cancer. Hypothesis testing on the differential correlation
matrices is also considered. A test, which is particularly well suited for
testing against sparse alternatives, is introduced. In addition, other related
problems, including estimation of a single sparse correlation matrix,
estimation of the differential covariance matrices, and estimation of the
differential cross-correlation matrices, are also discussed.Comment: Accepted for publication in Journal of Multivariate Analysi
Differential expression analysis for sequence count data
*Motivation:* High-throughput nucleotide sequencing provides quantitative readouts in assays for RNA expression (RNA-Seq), protein-DNA binding (ChIP-Seq) or cell counting (barcode sequencing). Statistical inference of differential signal in such data requires estimation of their variability throughout the dynamic range. When the number of replicates is small, error modelling is needed to achieve statistical power.

*Results:* We propose an error model that uses the negative binomial distribution, with variance and mean linked by local regression, to model the null distribution of the count data. The method controls type-I error and provides good detection power. 

*Availability:* A free open-source R software package, _DESeq_, is available from the Bioconductor project and from "http://www-huber.embl.de/users/anders/DESeq":http://www-huber.embl.de/users/anders/DESeq
Recommended from our members
Cost effective, experimentally robust differential-expression analysis for human/mammalian, pathogen and dual-species transcriptomics.
As sequencing read length has increased, researchers have quickly adopted longer reads for their experiments. Here, we examine 14 pathogen or host-pathogen differential gene expression data sets to assess whether using longer reads is warranted. A variety of data sets was used to assess what genomic attributes might affect the outcome of differential gene expression analysis including: gene density, operons, gene length, number of introns/exons and intron length. No genome attribute was found to influence the data in principal components analysis, hierarchical clustering with bootstrap support, or regression analyses of pairwise comparisons that were undertaken on the same reads, looking at all combinations of paired and unpaired reads trimmed to 36, 54, 72 and 101 bp. Read pairing had the greatest effect when there was little variation in the samples from different conditions or in their replicates (e.g. little differential gene expression). But overall, 54 and 72 bp reads were typically most similar. Given differences in costs and mapping percentages, we recommend 54 bp reads for organisms with no or few introns and 72 bp reads for all others. In a third of the data sets, read pairing had absolutely no effect, despite paired reads having twice as much data. Therefore, single-end reads seem robust for differential-expression analyses, but in eukaryotes paired-end reads are likely desired to analyse splice variants and should be preferred for data sets that are acquired with the intent to be community resources that might be used in secondary data analyses
- …
