507 research outputs found

    A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology

    Full text link
    The widespread availability of high-dimensional biological data has made the simultaneous screening of numerous biological characteristics a central statistical problem in computational biology. While the dimensionality of such datasets continues to increase, the problem of teasing out the effects of biomarkers in studies measuring baseline confounders while avoiding model misspecification remains only partially addressed. Efficient estimators constructed from data adaptive estimates of the data-generating distribution provide an avenue for avoiding model misspecification; however, in the context of high-dimensional problems requiring simultaneous estimation of numerous parameters, standard variance estimators have proven unstable, resulting in unreliable Type-I error control under standard multiple testing corrections. We present the formulation of a general approach for applying empirical Bayes shrinkage approaches to asymptotically linear estimators of parameters defined in the nonparametric model. The proposal applies existing shrinkage estimators to the estimated variance of the influence function, allowing for increased inferential stability in high-dimensional settings. A methodology for nonparametric variable importance analysis for use with high-dimensional biological datasets with modest sample sizes is introduced and the proposed technique is demonstrated to be robust in small samples even when relying on data adaptive estimators that eschew parametric forms. Use of the proposed variance moderation strategy in constructing stabilized variable importance measures of biomarkers is demonstrated by application to an observational study of occupational exposure. The result is a data adaptive approach for robustly uncovering stable associations in high-dimensional data with limited sample sizes

    Empirical Bayes Approach to Controlling Familywise Error: An Application to HIV Resistance Data

    Get PDF
    Statistical challenges arise in identifying meaningful patterns and structures from high dimensional genomic data sets. Relating HIV genotype (sequence of amino acids) to phenotypic resistance presents a typical problem. When the HIV virus is under antiretroviral drug pressure, unfavorable mutations of the target genes often lead to greatly increased resistance of the virus to drugs, including drugs the virus has not been exposed to. Identification of mutation combinations and their correlation to drug resistance is critical in guiding efficient prescription of HIV drugs. The identification of a subset of codons associated with drug resistance from a set of several hundreds of codons presents a multiple testing problem. Statistical issues arising from genomic data multiple testing procedures include the choice of the null test-statistic distribution used to define cut-offs. Controlling familywise error rate implies controlling the number of false positives among true nulls. Given the large number of hypotheses to be tested, the number of true nulls is unknown. We apply two multiple testing procedures (MTPs) controlling familywise error rate: an adhoc augmented-Bonferroni method and a Empirical Bayes procedure originally proposed in van der Laan, Birkner and Hubbard(2005). Using simulations, we demonstrate that the proposed MTPs are less conservative than the traditional methods such as Bonferroni and Holm\u27s procedures. We apply the methods to HIV resistance data where we wish to identify mutations in the protease gene associated with Amprenavir resistance

    Finding the signal in the noise: Could social media be utilized for early hospital notification of multiple casualty events?

    Get PDF
    IntroductionDelayed notification and lack of early information hinder timely hospital based activations in large scale multiple casualty events. We hypothesized that Twitter real-time data would produce a unique and reproducible signal within minutes of multiple casualty events and we investigated the timing of the signal compared with other hospital disaster notification mechanisms.MethodsUsing disaster specific search terms, all relevant tweets from the event to 7 days post-event were analyzed for 5 recent US based multiple casualty events (Boston Bombing [BB], SF Plane Crash [SF], Napa Earthquake [NE], Sandy Hook [SH], and Marysville Shooting [MV]). Quantitative and qualitative analysis of tweet utilization were compared across events.ResultsOver 3.8 million tweets were analyzed (SH 1.8 m, BB 1.1m, SF 430k, MV 250k, NE 205k). Peak tweets per min ranged from 209-3326. The mean followers per tweeter ranged from 3382-9992 across events. Retweets were tweeted a mean of 82-564 times per event. Tweets occurred very rapidly for all events (<2 mins) and represented 1% of the total event specific tweets in a median of 13 minutes of the first 911 calls. A 200 tweets/min threshold was reached fastest with NE (2 min), BB (7 min), and SF (18 mins). If this threshold was utilized as a signaling mechanism to place local hospitals on standby for possible large scale events, in all case studies, this signal would have preceded patient arrival. Importantly, this threshold for signaling would also have preceded traditional disaster notification mechanisms in SF, NE, and simultaneous with BB and MV.ConclusionsSocial media data has demonstrated that this mechanism is a powerful, predictable, and potentially important resource for optimizing disaster response. Further investigated is warranted to assess the utility of prospective signally thresholds for hospital based activation

    Population Intervention Models in Causal Inference

    Get PDF
    Marginal structural models (MSM) provide a powerful tool for estimating the causal effect of a] treatment variable or risk variable on the distribution of a disease in a population. These models, as originally introduced by Robins (e.g., Robins (2000a), Robins (2000b), van der Laan and Robins (2002)), model the marginal distributions of treatment-specific counterfactual outcomes, possibly conditional on a subset of the baseline covariates, and its dependence on treatment. Marginal structural models are particularly useful in the context of longitudinal data structures, in which each subject\u27s treatment and covariate history are measured over time, and an outcome is recorded at a final time point. In addition to the simpler, weighted regression approaches (inverse probability of treatment weighted estimators), more general (and robust) estimators have been developed and studied in detail for standard MSM (Robins (2000b), Neugebauer and van der Laan (2004), Yu and van der Laan (2003), van der Laan and Robins (2002)). In this paper we argue that in many applications one is interested in modeling the difference between a treatment-specific counterfactual population distribution and the actual population distribution of the target population of interest. Relevant parameters describe the effect of a hypothetical intervention on such a population, and therefore we refer to these models as intervention models. We focus on intervention models estimating the effect on an intervention in terms of a difference of means, ratio in means (e.g., relative risk if the outcome is binary), a so called switch relative risk for binary outcomes, and difference in entire distributions as measured by the quantile-quantile function. In addition, we provide a class of inverse probability of treatment weighed estimators, and double robust estimators of the causal parameters in these models. We illustrate the finite sample performance of these new estimators in a simulation study

    An Application Of Machine Learning Methods To The Derivation Of Exposure-Response Curves For Respiratory Outcomes

    Get PDF
    Analyses of epidemiological studies of the association between short-term changes in air pollution and health outcomes have not sufficiently discussed the degree to which the statistical models chosen for these analyses reflect what is actually known about the true data-generating distribution. We present a method to estimate population-level ambient air pollution (NO2) exposure-health (wheeze in children with asthma) response functions that is not dependent on assumptions about the data-generating function that underlies the observed data and which focuses on a specific scientific parameter of interest (the marginal adjusted association of exposure on probability of wheeze, over a grid of possible exposure values). We show that this approach provides a more nuanced summary of the data than more typical statistical methods used in air pollution epidemiology and epidemiological studies in general

    Nonparametric population average models: deriving the form of approximate population average models estimated using generalized estimating equations

    Get PDF
    For estimating regressions for repeated measures outcome data, a popular choice is the population average models estimated by generalized estimating equations (GEE). We review in this report the derivation of the robust inference (sandwich-type estimator of the standard error). In addition, we present formally how the approximation of a misspecified working population average model relates to the true model and in turn how to interpret the results of such a misspecified model

    Efficacy Studies of Malaria Treatments in Africa: Efficient Estimation with Missing Indicators of Failure

    Get PDF
    Efficacy studies of malaria treatments can be plagued by indeterminate outcomes for some patients. The study motivating this paper defines the outcome of interest (treatment failure) as recrudescence and for some subjects, it is unclear whether a recurrence of malaria is due to that or new infection. This results in a specific kind of missing data. The effect of missing data in causal inference problems is widely recognized. Methods that adjust for possible bias from missing data include a variety of imputation procedures (extreme case analysis, hot-deck, single and multiple imputation), inverse weighting methods, and likelihood based methods (data augmentation, EM procedures and their extensions). In this article, we focus on multiple imputation, two inverse weighting procedures (the inverse probability of censoring weighted (IPCW) and the doubly robust (DR) estimators), and a likelihood based methodology (G-computation), comparing the methods\u27 applicability to the efficient estimation of malaria treatments effects. We present results from a simulation study as well as results from a data analysis of malaria efficacy studies from Uganda

    Transfer Learning With Efficient Estimators to Optimally Leverage Historical Data in Analysis of Randomized Trials

    Full text link
    Randomized controlled trials (RCTs) are a cornerstone of comparative effectiveness because they remove the confounding bias present in observational studies. However, RCTs are typically much smaller than observational studies because of financial and ethical considerations. Therefore it is of great interest to be able to incorporate plentiful observational data into the analysis of smaller RCTs. Previous estimators developed for this purpose rely on unrealistic additional assumptions without which the added data can bias the effect estimate. Recent work proposed an alternative method (prognostic adjustment) that imposes no additional assumption and increases efficiency in the analysis of RCTs. The idea is to use the observational data to learn a prognostic model: a regression of the outcome onto the covariates. The predictions from this model, generated from the RCT subjects' baseline variables, are used as a covariate in a linear model. In this work, we extend this framework to work when conducting inference with nonparametric efficient estimators in trial analysis. Using simulations, we find that this approach provides greater power (i.e., smaller standard errors) than without prognostic adjustment, especially when the trial is small. We also find that the method is robust to observed or unobserved shifts between the observational and trial populations and does not introduce bias. Lastly, we showcase this estimator leveraging real-world historical data on a randomized blood transfusion study of trauma patients.Comment: 12 pages, 3 figure

    The Impact Of Coarsening The Explanatory Variable Of Interest In Making Causal Inferences: Implicit Assumptions Behind Dichotomizing Variables

    Get PDF
    It is common in analyses designed to estimate the causal effect of a continuous exposure/treatment to dichotomize the variable of interest. By dichotomizing the variable and assessing the causal effect of the newly fabricated variable practitioners are implicitly making assumptions. However, in most analyses these assumptions are ignored. In this article we formally address what assumptions are made in dichotomizing variables to assess causal effects. We introduce two assumptions, either of which must be met, in order for the estimates of the causal effects to be unbiased estimates of the parameters of interest. We title those assumptions the Mechanism Equivalence and Effect Equivalence assumptions. Furthermore, we quantify the bias induced when these assumptions are violated. Lastly, we present an analysis of a Malaria study that exemplifies the danger of naively dichotomizing a continuous variable to assess a causal effect
    • …
    corecore