116 research outputs found

    A Markov chain representation of the multiple testing problem

    Get PDF
    The problem of multiple hypothesis testing can be represented as a Markov process where a new alternative hypothesis is accepted in accordance with its relative evidence to the currently accepted one. This virtual and not formally observed process provides the most probable set of non null hypotheses given the data; it plays the same role as Markov Chain Monte Carlo in approximating a posterior distribution. To apply this representation and obtain the posterior probabilities over all alternative hypotheses, it is enough to have, for each test, barely defined Bayes Factors, e.g. Bayes Factors obtained up to an unknown constant. Such Bayes Factors may either arise from using default and improper priors or from calibrating p-values with respect to their corresponding Bayes Factor lower bound. Both sources of evidence are used to form a Markov transition kernel on the space of hypotheses. The approach leads to easy interpretable results and involves very simple formulas suitable to analyze large datasets as those arising from gene expression data (microarray or RNA-seq experiments)

    A new multitest correction (SGoF) that increases its statistical power when increasing the number of tests

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The detection of true significant cases under multiple testing is becoming a fundamental issue when analyzing high-dimensional biological data. Unfortunately, known multitest adjustments reduce their statistical power as the number of tests increase. We propose a new multitest adjustment, based on a sequential goodness of fit metatest (SGoF), which increases its statistical power with the number of tests. The method is compared with Bonferroni and FDR-based alternatives by simulating a multitest context via two different kinds of tests: 1) one-sample t-test, and 2) homogeneity G-test.</p> <p>Results</p> <p>It is shown that SGoF behaves especially well with small sample sizes when 1) the alternative hypothesis is weakly to moderately deviated from the null model, 2) there are widespread effects through the family of tests, and 3) the number of tests is large.</p> <p>Conclusion</p> <p>Therefore, SGoF should become an important tool for multitest adjustment when working with high-dimensional biological data.</p

    Nonparametric Estimation in a Compound Mixture Model and False Discovery Rate Control with Auxiliary Information

    Get PDF
    In this thesis, we focus on two important statistical problems. The first is the nonparametric estimation in a compound mixture model with application to the malaria study. The second is the control of the false discovery rate in multiple hypothesis testing applications with auxiliary information. Malaria can be diagnosed by the presence of parasites and symptoms (usually fever) due to parasites. In endemic areas, however, an individual may have fever attributable either to malaria or to other causes. Thus, the parasite level of an individual with fever follows a two-component mixture distribution, with the two components corresponding to malaria and nonmalaria individuals. Furthermore, the parasite levels of nonmalaria individuals can be characterized as a mixture of a zero component and a positive distribution, while the parasite levels of malaria individuals can only be positive. Therefore, the parasite level of an individual with fever follows a compound mixture model. In Chapter 2, we propose a maximum multinomial likelihood approach for estimating the unknown parameters/functions using parasite-level data from two groups of individuals: the first group only contains the malaria individuals, while the second group is a mixture of malaria and nonmalaria individuals. We develop an EM-algorithm to numerically calculate the maximum multinomial likelihood estimates and further establish their convergence rates. Simulation results show that the proposed maximum multinomial likelihood estimators are more efficient than existing nonparametric estimators. The proposed method is used to analyze a malaria survey data. In many multiple hypothesis testing applications, thousands of null hypotheses are tested simultaneously. For each null hypothesis, usually a test statistic and the corresponding p-value are calculated. Traditional rejection rules work on p-values and hence ignore the signs of the test statistics in two-sided tests. However, the signs may carry useful directional information in two-group comparison settings. In Chapter 3, we introduce a novel procedure, the signed-knockoff procedure, to utilize the directional information and control the false discovery rate in finite samples. We demonstrate the power advantage of our procedure through simulation studies and two real applications. In Chapter 4, we further extend the signed-knockoff procedure to incorporate additional information from covariates, which are subject to missing. We propose a new procedure, the covariate and direction adaptive knockoff procedure, and show that our procedure can control the false discovery rate in finite samples. Simulation studies and real data analysis show that our procedure is competitive to existing covariate-adaptive methods. In Chapter 5, we summarize our contributions and outline several interesting topics worthy of further exploration in the future

    Assessing Significance in High-Throughput Experiments by Sequential Goodness of Fit and q-Value Estimation

    Get PDF
    We developed a new multiple hypothesis testing adjustment called SGoF+ implemented as a sequential goodness of fit metatest which is a modification of a previous algorithm, SGoF, taking advantage of the information of the distribution of p-values in order to fix the rejection region. The new method uses a discriminant rule based on the maximum distance between the uniform distribution of p-values and the observed one, to set the null for a binomial test. This new approach shows a better power/pFDR ratio than SGoF. In fact SGoF+ automatically sets the threshold leading to the maximum power and the minimum false non-discovery rate inside the SGoF' family of algorithms. Additionally, we suggest combining the information provided by SGoF+ with the estimate of the FDR that has been committed when rejecting a given set of nulls. We study different positive false discovery rate, pFDR, estimation methods to combine q-value estimates jointly with the information provided by the SGoF+ method. Simulations suggest that the combination of SGoF+ metatest with the q-value information is an interesting strategy to deal with multiple testing issues. These techniques are provided in the latest version of the SGoF+ software freely available at http://webs.uvigo.es/acraaj/SGoF.htm

    Genome-wide power calculation and experimental design in RNA-Seq experiment

    Get PDF
    Next Generation Sequencing (NGS) technology is emerging as an appealing tool in characterizing genomic profiles of target population. However, the high sequencing expense and bioinformatic complexity will continue to be obstacles for many biomedical projects in the foreseeable future. Modelling of NGS data not only involves sample size and genome-wide power inference, but also includes consideration of sequencing depth and count data property. Given total budget and pre-specified cost parameters such as unit sequencing and sample collection, researchers usually seek for a two-dimensional optimal decision. In this dissertation, I will introduce a novel method SeqDEsign, which is developed to predict genome-wide power (EDR) of detecting differential expression (DE) genes in RNASeq experiment under targeted sample size (N’) and read depth (R’) given a pilot data (N,R). We aimed at providing advice for researchers regarding the design of RNA-Seq experiment with a limited budget. The first part of this dissertation is about predicting genome-wide power at N’ with R being fixed. The pipeline started with hypothesis test for differential expressed gene detection based on Wald test and negative binomial assumption. We proposed ways to directly model p-value distribution by both parametric and semi-parametric mixture model. To predict the genome-wide power of DE gene detection at N, posterior approaches based on either parametric or non-parametric model were implemented. In the second part, we discussed ways to extend power prediction to N’ and R’ simultaneously. Both nested down-sampling (NDS) scheme and model-based (MB) method were proposed and compared. The three-dimensional EDR surface (Pow(N’,R’)) was constructed by two-way inverse power law model. Finally, we discussed the cost-benefit analysis of RNA-Seq experiment with specification of a cost function. We also explored answers to other practical questions for experimental design. This framework was illustrated in both simulations and a real data application of rat RNA-Seq data. The public health relevance of this work lies in the development of a novel methodology for genome-wide power calculation of RNA-Seq experiment. By accurately predicting genome-wide power, researchers can detect more biologically meaningful bio-markers, which will promote better understanding of human disease

    Small data: practical modeling issues in human-model -omic data

    Get PDF
    This thesis is based on the following articles: Chapter 2: Holsbø, E., Perduca, V., Bongo, L.A., Lund, E. & Birmelé, E. (Manuscript). Stratified time-course gene preselection shows a pre-diagnostic transcriptomic signal for metastasis in blood cells: a proof of concept from the NOWAC study. Available at https://doi.org/10.1101/141325. Chapter 3: Bøvelstad, H.M., Holsbø, E., Bongo, L.A. & Lund, E. (Manuscript). A Standard Operating Procedure For Outlier Removal In Large-Sample Epidemiological Transcriptomics Datasets. Available at https://doi.org/10.1101/144519. Chapter 4: Holsbø, E. & Perduca, V. (2018). Shrinkage estimation of rate statistics. Case Studies in Business, Industry and Government Statistics 7(1), 14-25. Also available at http://hdl.handle.net/10037/14678.Human-model data are very valuable and important in biomedical research. Ethical and economical constraints limit the access to such data, and consequently these datasets rarely comprise more than a few hundred observations. As measurements are comparatively cheap, the tendency is to measure as many things as possible for the few, valuable participants in a study. With -omics technologies it is cheap and simple to make hundreds of thousands of measurements simultaneously. This few observations–many measurements setting is a high-dimensional problem in the technical language. Most gene expression experiments measure the expression levels of 10 000–15 000 genes for fewer than 100 subjects. I refer to this as the small data setting. This dissertation is an exercise in practical data analysis as it happens in a large epidemiological cohort study. It comprises three main projects: (i) predictive modeling of breast cancer metastasis from whole-blood transcriptomics measurements; (ii) standardizing a microarray data quality assessment in the Norwegian Women and Cancer (NOWAC) postgenome cohort; and (iii) shrinkage estimation of rates. These three are all small data analyses for various reasons. Predictive modeling in the small data setting is very challenging. There are several modern methods built to tackle high-dimensional data, but there is a need to evaluate these methods against one another when analyzing data in practice. Through the metastasis prediction work we learned first-hand that common practices in machine learning can be inefficient or harmful, especially for small data. I will outline some of the more important issues. In a large project such as NOWAC there is a need to centralize and disseminate knowledge and procedures. The standardization of NOWAC quality assessment was a project born of necessity. The standard operating procedure for outlier removal was developed so that preprocessing of the NOWAC microarray material should happen in the same way every time. We take this procedure from an archaic R-script that resided in peoples email inboxes to a well-documented, open-source R-package and present the NOWAC guidelines for microarray quality control. The procedure is built around the inherent high value of a singleobservation. Small data are plagued by high variance. Working with small data it is usually profitable to bias models by shrinkage or borrowing of information from elsewhere. We present a pseudo-Bayesian estimator of rates in an informal crime rate study. We exhibit the value of such procedures in a small data setting and demonstrate some novel considerations about the coverage properties of such a procedure. In short I gather some common practices in predictive modeling as applied to small data and assess their practical implications. I argue that with more focus on human-based datasets in biomedicine there is a need for particular consideration of these data in a small data paradigm to allow for reliable analysis. I will present what I believe to be sensible guidelines

    Statistical Methods for Monte-Carlo based Multiple Hypothesis Testing

    Get PDF
    Statistical hypothesis testing is a key technique to perform statistical inference. The main focus of this work is to investigate multiple testing under the assumption that the analytical p-values underlying the tests for all hypotheses are unknown. Instead, we assume that they can be approximated by drawing Monte Carlo samples under the null. The first part of this thesis focuses on the computation of test results with a guarantee on their correctness, that is decisions on multiple hypotheses which are identical to the ones obtained with the unknown p-values. We present MMCTest, an algorithm to implement a multiple testing procedure which yields correct decisions on all hypotheses (up to a pre-specified error probability) based solely on Monte Carlo simulation. MMCTest offers novel ways to evaluate multiple hypotheses as it allows to obtain the (previously unknown) correct decision on hypotheses (for instance, genes) in real data studies (again up to an error probability pre-specified by the user). The ideas behind MMCTest are generalised in a framework for Monte Carlo based multiple testing, demonstrating that existing methods giving no guarantees on their test results can be modified to yield certain theoretical guarantees on the correctness of their outputs. The second part deals with multiple testing from a practical perspective. We assume that in practice, it might also be desired to sacrifice the additional computational effort needed to obtain guaranteed decisions and to invest it instead in the computation of a more accurate ad-hoc test result. This is attempted by QuickMMCTest, an algorithm which adaptively allocates more samples to hypotheses whose decisions are more prone to random fluctuations, thereby achieving an improved accuracy. This work also derives the optimal allocation of a finite number of samples to finitely many hypotheses under a normal approximation, where the optimal allocation is understood as the one minimising the expected number of erroneously classified hypotheses (with respect to the classification based on the analytical p-values). An empirical comparison of the optimal allocation of samples to the one computed by QuickMMCTest indicates that the behaviour of QuickMMCTest might not be too far away from being optimal.Open Acces
    • …
    corecore