370 research outputs found

    A comparative review of estimates of the proportion unchanged genes and the false discovery rate

    Get PDF
    BACKGROUND: In the analysis of microarray data one generally produces a vector of p-values that for each gene give the likelihood of obtaining equally strong evidence of change by pure chance. The distribution of these p-values is a mixture of two components corresponding to the changed genes and the unchanged ones. The focus of this article is how to estimate the proportion unchanged and the false discovery rate (FDR) and how to make inferences based on these concepts. Six published methods for estimating the proportion unchanged genes are reviewed, two alternatives are presented, and all are tested on both simulated and real data. All estimates but one make do without any parametric assumptions concerning the distributions of the p-values. Furthermore, the estimation and use of the FDR and the closely related q-value is illustrated with examples. Five published estimates of the FDR and one new are presented and tested. Implementations in R code are available. RESULTS: A simulation model based on the distribution of real microarray data plus two real data sets were used to assess the methods. The proposed alternative methods for estimating the proportion unchanged fared very well, and gave evidence of low bias and very low variance. Different methods perform well depending upon whether there are few or many regulated genes. Furthermore, the methods for estimating FDR showed a varying performance, and were sometimes misleading. The new method had a very low error. CONCLUSION: The concept of the q-value or false discovery rate is useful in practical research, despite some theoretical and practical shortcomings. However, it seems possible to challenge the performance of the published methods, and there is likely scope for further developing the estimates of the FDR. The new methods provide the scientist with more options to choose a suitable method for any particular experiment. The article advocates the use of the conjoint information regarding false positive and negative rates as well as the proportion unchanged when identifying changed genes

    Zipper plot : visualizing transcriptional activity of genomic regions

    Get PDF
    Background: Reconstructing transcript models from RNA-sequencing (RNA-seq) data and establishing these as independent transcriptional units can be a challenging task. Current state-of-the-art tools for long non-coding RNA (lncRNA) annotation are mainly based on evolutionary constraints, which may result in false negatives due to the overall limited conservation of lncRNAs. Results: To tackle this problem we have developed the Zipper plot, a novel visualization and analysis method that enables users to simultaneously interrogate thousands of human putative transcription start sites (TSSs) in relation to various features that are indicative for transcriptional activity. These include publicly available CAGE-sequencing, ChIP-sequencing and DNase-sequencing datasets. Our method only requires three tab-separated fields (chromosome, genomic coordinate of the TSS and strand) as input and generates a report that includes a detailed summary table, a Zipper plot and several statistics derived from this plot. Conclusion: Using the Zipper plot, we found evidence of transcription for a set of well-characterized lncRNAs and observed that fewer mono-exonic lncRNAs have CAGE peaks overlapping with their TSSs compared to multi-exonic lncRNAs. Using publicly available RNA-seq data, we found more than one hundred cases where junction reads connected protein-coding gene exons with a downstream mono-exonic lncRNA, revealing the need for a careful evaluation of lncRNA 5′-boundaries. Our method is implemented using the statistical programming language R and is freely available as a webtool

    Topics on statistical design and analysis of cDNA microarray experiment

    Get PDF
    A microarray is a powerful tool for surveying the expression levels of many thousands of genes simultaneously. It belongs to the new genomics technologies which have important applications in the biological, agricultural and pharmaceutical sciences. In this thesis, we focus on the dual channel cDNA microarray which is one of the most popular microarray technologies and discuss three different topics: optimal experimental design; estimating the true proportion of true nulls, local false discovery rate (lFDR) and positive false discovery rate (pFDR) and dye effect normalization. The first topic consists of four subtopics each of which is about an independent and practical problem of cDNA microarray experimental design. In the first subtopic, we propose an optimization strategy which is based on the simulated annealing method to find optimal or near-optimal designs with both biological and technical replicates. In the second subtopic, we discuss how to apply Q-criterion for the factorial design of microarray experiments. In the third subtopic, we suggest an optimal way of pooling samples, which is actually a replication scheme to minimize the variance of the experiment under the constraint of fixing the total cost at a certain level. In the fourth subtopic, we indicate that the criterion for distant pair design is not proper and propose an alternative criterion instead. The second topic of this thesis is dye effect normalization. For cDNA microarray technology, each array compares two samples which are usually labelled with different dyes Cy3 and Cy5. It assumes that: for a given gene (spot) on the array, if Cy3-labelled sample has k times as much of a transcript as the Cy5-labelled sample, then the Cy3 signal should be k times as high as the Cy5 signal, and vice versa. This important assumption requires that the dyes should have the same properties. However, the reality is that the Cy3 and Cy5 dyes have slightly different properties and the relative efficiency of the dyes vary across the intensity range in a "banana-shape" way. In order to remove the dye effect, we propose a novel dye effect normalization method which is based on modeling dye response functions and dye effect curve. Real and simulated microarray data sets are used to evaluate the method. It shows that the performance of the proposed method is satisfactory. The focus of the third topic is the estimation of the proportion of true null hypotheses, lFDR and pFDR. In a typical microarray experiment, a large number of gene expression data could be measured. In order to find differential expressed genes, these variables are usually screened by a statistical test simultaneously. Since it is a case of multiple hypothesis testing, some kind of adjustment should be made to the p-values resulted from the statistical test. Lots of multiple testing error rates, such as FDR, lFDR and pFDR have been proposed to address this issue. A key related problem is the estimation of the proportion of true null hypotheses (i.e. non-expressed genes). To model the distribution of the p-values, we propose three kinds of finite mixture of unknown number of components (the first component corresponds to differentially expressed genes and the rest components correspond to non-differentially expressed ones). We apply a new MCMC method called allocation sampler to estimate the proportion of true null (i.e. the mixture weight of the first component). The method also provides a framework for estimating lFDR and pFDR. Two real microarray data studies plus a small simulation study are used to assess our method. We show that the performance of the proposed method is satisfactory

    An improved procedure for gene selection from microarray experiments using false discovery rate criterion

    Get PDF
    BACKGROUND: A large number of genes usually show differential expressions in a microarray experiment with two types of tissues, and the p-values of a proper statistical test are often used to quantify the significance of these differences. The genes with small p-values are then picked as the genes responsible for the differences in the tissue RNA expressions. One key question is what should be the threshold to consider the p-values small. There is always a trade off between this threshold and the rate of false claims. Recent statistical literature shows that the false discovery rate (FDR) criterion is a powerful and reasonable criterion to pick those genes with differential expression. Moreover, the power of detection can be increased by knowing the number of non-differential expression genes. While this number is unknown in practice, there are methods to estimate it from data. The purpose of this paper is to present a new method of estimating this number and use it for the FDR procedure construction. RESULTS: A combination of test functions is used to estimate the number of differentially expressed genes. Simulation study shows that the proposed method has a higher power to detect these genes than other existing methods, while still keeping the FDR under control. The improvement can be substantial if the proportion of true differentially expressed genes is large. This procedure has also been tested with good results using a real dataset. CONCLUSION: For a given expected FDR, the method proposed in this paper has better power to pick genes that show differentiation in their expression than two other well known methods

    Differential Abundance and Clustering Analysis with Empirical Bayes Shrinkage Estimation of Variance (DASEV) for Proteomics and Metabolomics Data

    Get PDF
    Mass spectrometry (MS) is widely used for proteomic and metabolomic profiling of biological samples. Data obtained by MS are often zero-inflated. Those zero values are called point mass values (PMVs). Zero values can be further grouped into biological PMVs and technical PMVs. The former type is caused by the absence of components and the latter type is caused by detection limit. There is no simple solution to separate those two types of PMVs. Mixture models were developed to separate the two types of zeros apart and to perform the differential abundance analysis. However, we notice that the mixture model can be unstable when the number of non-zero values is small. In this dissertation, we propose a new differential abundance (DA) analysis method, DASEV, which applies an empirical Bayes shrinkage estimation on variance. We hypothesized that performance on variance estimation could be more robust and thus enhance the accuracy of differential abundance analysis. Disregarding the issue the mixture models have, the method has shown promising strategies to separate two types of PMVs. We adapted the mixture distribution proposed in the original mixture model design and assumed that the variances for all components follow a certain distribution. We proposed to calculate the estimated variances by borrowing information from other components via applying the assumed distribution of variance, and then re-estimate other parameters using the estimated variances. We obtained better and more stable estimations on variance, means abundances, and proportions of biological PMVs, especially where the proportion of zeros is large. Therefore, the proposed method achieved obvious improvements in DA analysis. We also propose to extend the method for clustering analysis. To our knowledge, commonly used cluster methods for MS omics data are only K-means and Hierarchical. Both methods have their own limitations while being applied to the zero-inflated data. Model-based clustering methods are widely used by researchers for various data types including zero-inflated data. We propose to use the extension (DASEV.C) as a model-based cluster method. We compared the clustering performance of DASEV.C with K-means and Hierarchical. Under certain scenarios, the proposed method returned more accurate clusters than the standard methods. We also develop an R package dasev for the proposed methods presented in this dissertation. The major functions DASEV.DA and DASEV.C in this R package aim to implement the Bayes shrinkage estimation on variance then conduct the differential abundance and cluster analysis. We designed the functions to allow the flexibility for researchers to specify certain input options

    Empirical Bayes methods corrected for small numbers of tests

    Full text link
    Histogram-based empirical Bayes methods developed for analyzing data for large numbers of genes, SNPs, or other biological features tend to have large biases when applied to data with a smaller number of features such as genes with expression measured conventionally, proteins, and metabolites. To analyze such small-scale and medium-scale data in an empirical Bayes framework, we introduce corrections of maximum likelihood estimators (MLE) of the local false discovery rate (LFDR). In this context, the MLE estimates the LFDR, which is a posterior probability of null hypothesis truth, by estimating the prior distribution. The corrections lie in excluding each feature when estimating one or more parameters on which the prior depends. An application of the new estimators and previous estimators to protein abundance data illustrates how different estimators lead to very different conclusions about which proteins are affected by cancer. The estimators are compared using simulated data of two different numbers of features, two different detectability levels, and all possible numbers of affected features. The simulations show that some of the corrected MLEs substantially reduce a negative bias of the MLE. (The best-performing corrected MLE was derived from the minimum description length principle.) However, even the corrected MLEs have strong negative biases when the proportion of features that are unaffected is greater than 90%. Therefore, since the number of affected features is unknown in the case of real data, we recommend an optimally weighted combination of the best of the corrected MLEs with a conservative estimator that has weaker parametric assumptions.Comment: This version adds new methods and a simulation stud

    Combinatorial CRISPR-Cas9 screens for de novo mapping of genetic interactions.

    Get PDF
    We developed a systematic approach to map human genetic networks by combinatorial CRISPR-Cas9 perturbations coupled to robust analysis of growth kinetics. We targeted all pairs of 73 cancer genes with dual guide RNAs in three cell lines, comprising 141,912 tests of interaction. Numerous therapeutically relevant interactions were identified, and these patterns replicated with combinatorial drugs at 75% precision. From these results, we anticipate that cellular context will be critical to synthetic-lethal therapies
    corecore