44 research outputs found
Filtering, FDR and power
Background: In high-dimensional data analysis such as differential gene expression analysis, people often use filtering methods like fold-change or variance filters in an attempt to reduce the multiple testing penalty and improve power. However, filtering may introduce a bias on the multiple testing correction. The precise amount of bias depends on many quantities, such as fraction of probes filtered out, filter statistic and test statistic used.Results: We show that a biased multiple testing correction results if non-differentially expressed probes are not filtered out with equal probability from the entire range of p-values. We illustrate our results using both a simulation study and an experimental dataset, where the FDR is shown to be biased mostly by filters that are associated with the hypothesis being tested, such as the fold change. Filters that induce little bias on the FDR yield less additional power of detecting differentially expressed genes. Finally, we propose a statistical test that can be used in practice to determine whether any chosen filter introduces bias on the FDR estimate used, given a general experimental setup.Conclusions: Filtering out of probes must be used with care as it may bias the multiple testing correction. Researchers can use our test for FDR bias to guide their choice of filter and amount of filtering in practice
Testing for association between RNA-Seq and high-dimensional data
Background: Testing for association between RNA-Seq and other genomic data is challenging due to high variability of the former and high dimensionality of the latter. Results: Using the negative binomial distribution and a random-effects model, we develop an omnibus test that overcomes both difficulties. It may be conceptualised as a test of overall significance in regression analysis, where the response variable is overdispersed and the number of explanatory variables exceeds the sample size. Conclusions: The proposed test can detect genetic and epigenetic alterations that affect gene expression. It can examine complex regulatory mechanisms of gene expression. The R package globalSeq is available from Bioconductor
Sparse classification with paired covariates
Funder: Department of Epidemiology and Biostatistics, Amsterdam UMC, VU University AmsterdamAbstractThis paper introduces the paired lasso: a generalisation of the lasso for paired covariate settings. Our aim is to predict a single response from two high-dimensional covariate sets. We assume a one-to-one correspondence between the covariate sets, with each covariate in one set forming a pair with a covariate in the other set. Paired covariates arise, for example, when two transformations of the same data are available. It is often unknown which of the two covariate sets leads to better predictions, or whether the two covariate sets complement each other. The paired lasso addresses this problem by weighting the covariates to improve the selection from the covariate sets and the covariate pairs. It thereby combines information from both covariate sets and accounts for the paired structure. We tested the paired lasso on more than 2000 classification problems with experimental genomics data, and found that for estimating sparse but predictive models, the paired lasso outperforms the standard and the adaptive lasso. The R package is available from cran.</jats:p
Can subtle changes in gene expression be consistently detected with different microarray platforms?
Background: The comparability of gene expression data generated with different microarray platforms is still a matter of concern. Here we address the performance and the overlap in the detection of differentially expressed genes for five different microarray platforms in a challenging biological context where differences in gene expression are few and subtle. Results: Gene expression profiles in the hippocampus of five wild-type and five transgenic Ī“C-doublecortin-like kinase mice were evaluated with five microarray platforms: Applied Biosystems, Affymetrix, Agilent, Illumina, LGTC home-spotted arrays. Using a fixed false discovery rate of 10% we detected surprising differences between the number of differentially expressed genes per platform. Four genes were selected by ABI, 130 by Affymetrix, 3,051 by Agilent, 54 by Illumina, and 13 by LGTC. Two genes were found significantly differentially expressed by all platforms and the four genes identified by the ABI platform were found by at least three other platforms. Quantitative RT-PCR analysis confirmed 20 out of 28 of the genes detected by two or more platforms and 8 out of 15 of the genes detected by Agilent only. We observed improved correlations between platforms when ranking the genes based on the significance level than with a fixed statistical cut-off. We demonstrate significant overlap in the affected gene sets identified by the different platforms, although biological processes were represented by only partially overlapping sets of genes. Aberrances in GABA-ergic signalling in the transgenic mice were consistently found by all platforms. Conclusion: The different microarray platforms give partially complementary views on biological processes affected. Our data indicate that when analyzing samples with only subtle differences in gene expression the use of two different platforms might be more attractive than increasing the number of replicates. Commercial two-color platforms seem to have higher power for finding differentially expressed genes between groups with small differences in expression
Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms
The hippocampal expression profiles of wild-type mice and mice transgenic for Ī“C-doublecortin-like kinase were compared with Solexa/Illumina deep sequencing technology and five different microarray platforms. With Illumina's digital gene expression assay, we obtained ā¼2.4 million sequence tags per sample, their abundance spanning four orders of magnitude. Results were highly reproducible, even across laboratories. With a dedicated Bayesian model, we found differential expression of 3179 transcripts with an estimated false-discovery rate of 8.5%. This is a much higher figure than found for microarrays. The overlap in differentially expressed transcripts found with deep sequencing and microarrays was most significant for Affymetrix. The changes in expression observed by deep sequencing were larger than observed by microarrays or quantitative PCR. Relevant processes such as calmodulin-dependent protein kinase activity and vesicle transport along microtubules were found affected by deep sequencing but not by microarrays. While undetectable by microarrays, antisense transcription was found for 51% of all genes and alternative polyadenylation for 47%. We conclude that deep sequencing provides a major advance in robustness, comparability and richness of expression profiling data and is expected to boost collaborative, comparative and integrative genomics studies
Integrated analysis of DNA copy number and gene expression microarray data using gene sets
Background: Genes that play an important role in tumorigenesis are expected to show association between DNA copy number and RNA expression. Optimal power to find such associations can only be achieved if analysing copy number and gene expression jointly. Furthermore, some copy number changes extend over larger chromosomal regions affecting the expression levels of multiple resident genes.
Mechanisms that clear mutations drive field cancerization in mammary tissue
Oncogenic mutations are abundant in theĀ tissues of healthy individuals, but rarely form tumours1ā3. Yet, the underlying protection mechanisms are largely unknown. To resolve these mechanisms in mouse mammary tissue, we use lineage tracing to map the fate of wild-type and Brca1ā/ā;Trp53ā/ā cells, and find that both follow a similar pattern of loss and spread within ducts. Clonal analysis reveals that ducts consist of small repetitive units of self-renewing cells that give rise to short-lived descendants. This offers a first layer of protection as any descendants, including oncogenic mutant cells, are constantly lost, thereby limiting the spread of mutations to a single stem cell-descendant unit. Local tissue remodelling during consecutive oestrous cycles leads to the cooperative and stochastic loss and replacement of self-renewing cells. This process provides a second layer of protection, leading to the elimination of most mutant clones while enabling the minority that by chance survive to expand beyond the stem cell-descendant unit. This leads to fields of mutant cells spanning large parts of the epithelial network, predisposing it for transformation. Eventually, clone expansion becomes restrained by the geometry of the ducts, providing a third layer of protection. Together, these mechanisms act to eliminate most cells that acquire somatic mutations at the expense of driving the accelerated expansion of a minority of cells, which can colonize large areas, leading to field cancerization
Quasi-variances
In statistical models of dependence, the effect of a categorical variable is typically described by contrasts among parameters. For reporting such effects, quasiāvariances provide an economical and intuitive method which permits approximate inference on any contrast by subsequent readers. Applications include generalised linear models, generalised additive models and hazard models. The present paper exposes the generality of quasiāvariances, emphasises the need to control relative errors of approximation, gives simple methods for obtaining quasiāvariances and bounds on the approximation error involved, and explores the domain of accuracy of the method. Conditions are identified under which the quasiāvariance approximation is exact, and numerical work indicates high accuracy in a variety of settings