20 research outputs found

    Application of a Variable Importance Measure Method to HIV-1 Sequence Data

    Get PDF
    van der Laan (2005) proposed a method to construct variable importance measures and provided the respective statistical inference. This technique involves determining the importance of a variable in predicting an outcome. This method can be applied as an inverse probability of treatment weighted (IPTW) or double robust inverse probability of treatment weighted (DR-IPTW) estimator. A respective significance of the estimator is determined by estimating the influence curve and hence determining the corresponding variance and p-value. This article applies the van der Laan (2005) variable importance measures and corresponding inference to HIV-1 sequence data. In this data application, protease and reverse transcriptase codon position on the HIV-1 strand are assessed to determine their respective variable importance, with respect to an outcome of viral replication capacity. We estimate the W-adjusted variable importance measure for a specified set of potential effect modifiers W. Both the IPTW and DR-IPTW methods were implemented on this datase

    Multiple Testing Procedures for Controlling Tail Probability Error Rates

    Get PDF
    The present article discusses and compares multiple testing procedures (MTP) for controlling Type I error rates defined as tail probabilities for the number (gFWER) and proportion (TPPFP) of false positives among the rejected hypotheses. Specifically, we consider the gFWER- and TPPFP-controlling MTPs proposed recently by Lehmann & Romano (2004) and in a series of four articles by Dudoit et al. (2004), van der Laan et al. (2004b,a), and Pollard & van der Laan (2004). The former Lehmann & Romano (2004) procedures are marginal, in the sense that they are based solely on the marginal distributions of the test statistics, i.e., on cut-off rules for the corresponding unadjusted p-values. In contrast, the procedures discussed in our previous articles take into account the joint distribution of the test statistics and apply to general data generating distributions, i.e., dependence structures among test statistics. The gFWER-controlling common-cut-off and common-quantile procedures of Dudoit et al. (2004) and Pollard & van der Laan (2004) are based on the distributions of maxima of test statistics and minima of unadjusted p-values, respectively. For a suitably chosen initial FWER-controlling procedure, the gFWER- and TPPFP-controlling augmentation multiple testing procedures (AMTP) of van der Laan et al. (2004a) can also take into account the joint distribution of the test statistics. Given a gFWER-controlling procedure, we also propose AMTPs for controlling tail probability error rates, Pr(g(V_n,R_n) \u3e q), for arbitrary functions g(V_n,R_n) of the numbers of false positives V_n and rejected hypotheses R_n. The different gFWER- and TPPFP-controlling procedures are compared in a simulation study, where the tests concern the components of the mean vector of a multivariate Gaussian data generating distribution. Among notable findings are the substantial power gains achieved by joint procedures compared to marginal procedures

    Multiple Testing and Data Adaptive Regression: An Application to HIV-1 Sequence Data

    Get PDF
    Analysis of viral strand sequence data and viral replication capacity could potentially lead to biological insights regarding the replication ability of HIV-1. Determining specific target codons on the viral strand will facilitate the manufacturing of target specific antiretrovirals. Various algorithmic and analysis techniques can be applied to this application. We propose using multiple testing to find codons which have significant univariate associations with replication capacity of the virus. We also propose using a data adaptive multiple regression algorithm to obtain multiple predictions of viral replication capacity based on an entire mutant/non-mutant sequence profile. The data set to which these techniques were applied consists of 317 patients, each with 282 sequenced protease and reverse transcriptase codons. Initially, the multiple testing procedure (Pollard and van der Laan, 2003) was applied to the individual specific viral sequence data. A single-step multiple testing procedure method was used to control the family wise error rate (FWER) at the five percent alpha level. Additional augmentation multiple testing procedures were applied to control the generalized family wise error (gFWER) or the tail probability of the proportion of false positives (TPPFP). Finally, the loss-based, cross-validated Deletion/Substitution/Addition regression algorithm (Sinisi and van der Laan, 2004) was applied to the dataset separately. This algorithm builds candidate estimators in the prediction of a univariate outcome by minimizing an empirical risk, and it uses cross-validation to select fine-tuning parameters such as: size of the regression model, maximum allowed order of interaction of terms in the regression model, and the dimension of the vector of covariates. This algorithm also is used to measure variable importance of the codons. Findings from these multiple analyses are consistent with biological findings and could possibly lead to further biological knowledge regarding HIV-1 viral data

    Resampling Based Multiple Testing Procedure Controlling Tail Probability of the Proportion of False Positives

    Get PDF
    Simultaneously testing a collection of null hypotheses about a data generating distribution based on a sample of independent and identically distributed observations is a fundamental and important statistical problem involving many applications. In this article we propose a new resampling based multiple testing procedure asymptotically controlling the probability that the proportion of false positives among the set of rejections exceeds q at level alpha, where q and alpha are user supplied numbers. The procedure involves 1) specifying a conditional distribution for a guessed set of true null hypotheses, given the data, which asymptotically is degenerate at the true set of null hypotheses, and 2) specifying a generally valid null distribution for the vector of test-statistics proposed in Pollard and van der Laan (2003), and generalized in our subsequent articles Dudoit et al. (2004), van der Laan et al. (2004a) and van der Laan et al. (2004b). We establish the finite sample rational behind our proposal, and prove that this new multiple testing procedure asymptotically controls the wished tail probability for the proportion of false positives under general data generating distributions. In addition, we provide simulation studies establishing that this method is generally more powerful in finite samples than our previously proposed augmentation multiple testing procedure (van der Laan et al. (2004b)) and competing procedures from the literature. Finally, we illustrate our methodology with a data analysis

    Application of a Multiple Testing Procedure Controlling the Proportion of False Positives to Protein and Bacterial Data

    Get PDF
    Simultaneously testing multiple hypotheses is important in high-dimensional biological studies. In these situations, one is often interested in controlling the Type-I error rate, such as the proportion of false positives to total rejections (TPPFP) at a specific level, alpha. This article will present an application of the E-Bayes/Bootstrap TPPFP procedure, presented in van der Laan et al. (2005), which controls the tail probability of the proportion of false positives (TPPFP), on two biological datasets. The two data applications include firstly, the application to a mass-spectrometry dataset of two leukemia subtypes, AML and ALL. The protein data measurements include intensity and mass-to-charge (m/z) ratios of bone marrow samples, with two replicates per sample. We apply techniques to preprocess the data; i.e. correct for baseline shift of the data as well as appropriately smooth the intensity profiles over the m/z values. After preprocessing the data we show an application of a TPPFP multiple testing techniques (van der Laan et al. (2005)) to test the difference between two groups of patients (AML/ALL) with respect to their intensity values over various m/z ratios, thus indicative of testing proteins of different sizes. Secondly, we will show an illustration of the E-Bayes/Bootstrap TPPFP procedure on a bacterial data set. In this application we are interested in finding bacteria whose mean difference over time points is differentially expressed between two U.S. cities. With both of these data applications, we also show comparisons to the van der Laan et al. (2004b) tppfp augmentation method, and discover the E-Bayes/Bootstrap TPPFP method is less conservative, therefore rejecting more tests at a specific alpha leve

    Data Adaptive Pathway Testing

    Get PDF
    A majority of diseases are caused by a combination of factors, for example, composite genetic mutation profiles have been found in many cases to predict a deleterious outcome. There are several statistical techniques that have been used to analyze these types of biological data. This article implements a general strategy which uses data adaptive regression methods to build a specific pathway model, thus predicting a disease outcome by a combination of biological factors and assesses the significance of this model, or pathway, by using a permutation based null distribution. We also provide several simulation comparisons with other techniques. In addition, this method is applied in several different ways to an HIV-1 dataset in order to assess the potential biological pathways in the data

    Multiple Testing Procedures and Applications to Genomics

    Get PDF
    This chapter proposes widely applicable resampling-based single-step and stepwise multiple testing procedures (MTP) for controlling a broad class of Type I error rates, in testing problems involving general data generating distributions (with arbitrary dependence structures among variables), null hypotheses, and test statistics (Dudoit and van der Laan, 2005; Dudoit et al., 2004a,b; van der Laan et al., 2004a,b; Pollard and van der Laan, 2004; Pollard et al., 2005). Procedures are provided to control Type I error rates defined as tail probabilities for arbitrary functions of the numbers of Type I errors, V_n, and rejected hypotheses, R_n. These error rates include: the generalized family-wise error rate, gFWER(k) = Pr(V_n \u3e k), or chance of at least (k+1) false positives (the special case k=0 corresponds to the usual family-wise error rate, FWER), and tail probabilities for the proportion of false positives among the rejected hypotheses, TPPFP(q) = Pr(V_n/R_n \u3e q). Single-step and step-down common-cut-off (maxT) and common-quantile (minP) procedures, that take into account the joint distribution of the test statistics, are proposed to control the FWER. In addition, augmentation multiple testing procedures are provided to control the gFWER and TPPFP, based on any initial FWER-controlling procedure. The results of a multiple testing procedure can be summarized using rejection regions for the test statistics, confidence regions for the parameters of interest, or adjusted p-values. A key ingredient of our proposed MTPs is the test statistics null distribution (and consistent bootstrap estimator thereof) used to derive rejection regions and corresponding confidence regions and adjusted p-values. This chapter illustrates an implementation in SAS (Version 9) of the bootstrap-based single-step maxT procedure and of the gFWER- and TPPFP-controlling augmentation procedures. These multiple testing procedures are applied to an HIV-1 sequence dataset to identify codon positions associated with viral replication capacity

    Issues of Processing and Multiple Testing of SELDI-TOF MS Proteomic Data

    Get PDF
    A new data filtering method for SELDI-TOF MS proteomic spectra data is described. We examined technical repeats (2 per subject) of intensity versus m/z (mass/charge) of bone marrow cell lysate for two groups of childhood leukemia patients: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). As others have noted, the type of data processing as well as experimental variability can have a disproportionate impact on the list of interesting proteins (see Baggerly et al. (2004)). We propose a list of processing and multiple testing techniques to correct for 1) background drift; 2) filtering using smooth regression and cross-validated bandwidth selection; 3) peak finding; and 4) methods to correct for multiple testing (van der Laan et al. (2005)). The result is a list of proteins (indexed by m/z) where average expression is significantly different among disease (or treatment, etc.) groups. The procedures are intended to provide a sensible and statistically driven algorithm, which we argue provides a list of proteins that have a significant difference in expression. Given no sources of unmeasured bias (such as confounding of experimental conditions with disease status), proteins found to be statistically significant using this technique have a low probability of being false positives

    Application of a Variable Importance Measure Method

    No full text
    Van der Laan (2005) proposed a targeted method used to construct variable importance measures coupled with respective statistical inference. This technique involves determining the importance of a variable in predicting an outcome. This method can be applied as inverse probability of treatment weighted (IPTW) or double robust inverse probability of treatment weighted (DR-IPTW) estimators. The variance and respective p-value of the estimate are calculated by estimating the influence curve. This article applies the Van der Laan (2005) variable importance measures and corresponding inference to HIV-1 sequence data. In this application, the method is targeted at every codon position. In this data application, protease and reverse transcriptase codon positions on the HIV-1 strand are assessed to determine their respective variable importance, with respect to an outcome of viral replication capacity. We estimate the DR-IPTW W-adjusted variable importance measure for a specified set of potential effect modifiers W. In addition, simulations were performed on two separate datasets to examine the DR-IPTW estimator.
    corecore