Search CORE

339,064 research outputs found

Targeted Learning on Variable Importance Measure for Heterogeneous Treatment Effect

Author: Hubbard Alan
Li Haodong
van der Laan Mark
Publication venue
Publication date: 23/09/2023
Field of study

Quantifying the heterogeneity of treatment effect is important for understanding how a commercial product or medical treatment affects different subgroups in a population. Beyond the overall impact reflected parameters like the average treatment effect, the analysis of treatment effect heterogeneity further reveals details on the importance of different covariates and how they lead to different treatment impacts. One relevant parameter that addresses such heterogeneity is the variance of treatment effect across different covariate groups, however the treatment effect is defined. One can also derive variable importance parameters that measure (and rank) how much of treatment effect heterogeneity is explained by a targeted subset of covariates. In this article, we propose a new targeted maximum likelihood estimator for a treatment effect variable importance measure. This estimator is a pure plug-in estimator that consists of two steps: 1) the initial estimation of relevant components to plug in and 2) an iterative updating step to optimize the bias-variance tradeoff. The simulation results show that this TMLE estimator has competitive performance in terms of lower bias and better confidence interval coverage compared to the simple substitution estimator and the estimating equation estimator. The application of this method also demonstrates the advantage of a substitution estimator, which always respects the global constraints on the data distribution and that the estimand is a particular function of the distribution

arXiv.org e-Print Archive

Practical targeted learning from large data sets by survey sampling

Author: Bertail Patrice
Chambaz Antoine
Joly Emilien
Publication venue
Publication date: 29/06/2016
Field of study

We address the practical construction of asymptotic confidence intervals for smooth (i.e., path-wise differentiable), real-valued statistical parameters by targeted learning from independent and identically distributed data in contexts where sample size is so large that it poses computational challenges. We observe some summary measure of all data and select a sub-sample from the complete data set by Poisson rejective sampling with unequal inclusion probabilities based on the summary measures. Targeted learning is carried out from the easier to handle sub-sample. We derive a central limit theorem for the targeted minimum loss estimator (TMLE) which enables the construction of the confidence intervals. The inclusion probabilities can be optimized to reduce the asymptotic variance of the TMLE. We illustrate the procedure with two examples where the parameters of interest are variable importance measures of an exposure (binary or continuous) on an outcome. We also conduct a simulation study and comment on its results. keywords: semiparametric inference; survey sampling; targeted minimum loss estimation (TMLE

arXiv.org e-Print Archive

Collection Of Biostatistics Research Archive

HAL-Polytechnique

Nonparametric variable importance assessment using machine learning techniques

Author: Carone Marco
Gilbert Peter B.
Simon Noah
Williamson Brian D
Publication venue: Collection of Biostatistics Research Archive
Publication date: 30/08/2017
Field of study

In a regression setting, it is often of interest to quantify the importance of various features in predicting the response. Commonly, the variable importance measure used is determined by the regression technique employed. For this reason, practitioners often only resort to one of a few regression techniques for which a variable importance measure is naturally defined. Unfortunately, these regression techniques are often sub-optimal for predicting response. Additionally, because the variable importance measures native to different regression techniques generally have a different interpretation, comparisons across techniques can be difficult. In this work, we study a novel variable importance measure that can be used with any regression technique, and whose interpretation is agnostic to the technique used. Specifically, we propose a generalization of the ANOVA variable importance measure, and discuss how it facilitates the use of possibly-complex machine learning techniques to flexibly estimate the variable importance of a single feature or group of features. Using the tools of targeted learning, we also describe how to construct an efficient estimator of this measure, as well as a valid confidence interval. Through simulations, we show that our proposal has good practical operating characteristics, and we illustrate its use with data from a study of the median house price in the Boston area, and a study of risk factors for cardiovascular disease in South Africa

Collection Of Biostatistics Research Archive

Variable Importance Analysis with the multiPIM R Package

Author: Hubbard Alan E.
Jewell Nicholas P.
Ritter Stephan J
Publication venue: Collection of Biostatistics Research Archive
Publication date: 12/07/2011
Field of study

We describe the R package multiPIM, including statistical background, functionality and user options. The package is for variable importance analysis, and is meant primarily for analyzing data from exploratory epidemiological studies, though it could certainly be applied in other areas as well. The approach taken to variable importance comes from the causal inference field, and is different from approaches taken in other R packages. By default, multiPIM uses a double robust targeted maximum likelihood estimator (TMLE) of a parameter akin to the attributable risk. Several regression methods/machine learning algorithms are available for estimating the nuisance parameters of the models, including super learner, a meta-learner which combines several different algorithms into one. We describe a simulation in which the double robust TMLE is compared to the graphical computation estimator. We also provide example analyses using two data sets which are included with the package

Collection Of Biostatistics Research Archive

A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology

Author: Hejazi Nima S.
Hubbard Alan E.
van der Laan Mark J.
Publication venue
Publication date: 28/05/2020
Field of study

The widespread availability of high-dimensional biological data has made the simultaneous screening of numerous biological characteristics a central statistical problem in computational biology. While the dimensionality of such datasets continues to increase, the problem of teasing out the effects of biomarkers in studies measuring baseline confounders while avoiding model misspecification remains only partially addressed. Efficient estimators constructed from data adaptive estimates of the data-generating distribution provide an avenue for avoiding model misspecification; however, in the context of high-dimensional problems requiring simultaneous estimation of numerous parameters, standard variance estimators have proven unstable, resulting in unreliable Type-I error control under standard multiple testing corrections. We present the formulation of a general approach for applying empirical Bayes shrinkage approaches to asymptotically linear estimators of parameters defined in the nonparametric model. The proposal applies existing shrinkage estimators to the estimated variance of the influence function, allowing for increased inferential stability in high-dimensional settings. A methodology for nonparametric variable importance analysis for use with high-dimensional biological datasets with modest sample sizes is introduced and the proposed technique is demonstrated to be robust in small samples even when relying on data adaptive estimators that eschew parametric forms. Use of the proposed variance moderation strategy in constructing stabilized variable importance measures of biomarkers is demonstrated by application to an observational study of occupational exposure. The result is a data adaptive approach for robustly uncovering stable associations in high-dimensional data with limited sample sizes

arXiv.org e-Print Archive

ZOO: Zeroth Order Optimization based Black-box Attacks to Deep Neural Networks without Training Substitute Models

Author: Chen Pin-Yu
Hsieh Cho-Jui
Sharma Yash
Yi Jinfeng
Zhang Huan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 02/11/2017
Field of study

Deep neural networks (DNNs) are one of the most prominent technologies of our time, as they achieve state-of-the-art performance in many machine learning tasks, including but not limited to image classification, text mining, and speech processing. However, recent research on DNNs has indicated ever-increasing concern on the robustness to adversarial examples, especially for security-critical tasks such as traffic sign identification for autonomous driving. Studies have unveiled the vulnerability of a well-trained DNN by demonstrating the ability of generating barely noticeable (to both human and machines) adversarial images that lead to misclassification. Furthermore, researchers have shown that these adversarial images are highly transferable by simply training and attacking a substitute model built upon the target model, known as a black-box attack to DNNs. Similar to the setting of training substitute models, in this paper we propose an effective black-box attack that also only has access to the input (images) and the output (confidence scores) of a targeted DNN. However, different from leveraging attack transferability from substitute models, we propose zeroth order optimization (ZOO) based attacks to directly estimate the gradients of the targeted DNN for generating adversarial examples. We use zeroth order stochastic coordinate descent along with dimension reduction, hierarchical attack and importance sampling techniques to efficiently attack black-box models. By exploiting zeroth order optimization, improved attacks to the targeted DNN can be accomplished, sparing the need for training substitute models and avoiding the loss in attack transferability. Experimental results on MNIST, CIFAR10 and ImageNet show that the proposed ZOO attack is as effective as the state-of-the-art white-box attack and significantly outperforms existing black-box attacks via substitute models.Comment: Accepted by 10th ACM Workshop on Artificial Intelligence and Security (AISEC) with the 24th ACM Conference on Computer and Communications Security (CCS

arXiv.org e-Print Archive

Crossref

Artificial Neural Networks applied to improve low-cost air quality monitoring precision

Author: Malagón Fernández Sergio
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/07/2018
Field of study

It is a fact that air pollution is a major environmental health problem that affects everyone, especially in urban areas. Furthermore, the cost of high-end air pollution monitoring sensors is considerably high, so public administrations are unable to afford to place an elevated number of measuring stations, leading to the loss of information that could be very helpful. Over the last few years, a large number of low-cost sensors have been released, but its use is often problematic, due to their selectivity and precision problems. A calibration process is needed in order to solve an issue with many parameters with no clear relationship among them, which is a field of application of Machine Learning. The objectives of this project are first, integrating three low-cost air quality sensors into a Raspberry Pi and then, training an Artificial Neural Network model that improves precision in the readings made by the sensors.Es un hecho que la contaminación del aire es un gran problema para la salud a nivel mundial, especialmente en zonas urbanas. Además, el coste de los sensores de contaminación de gama alta es considerablemente alto, por lo que los organismos públicos no pueden permitirse emplazar un gran número de estaciones de medida, perdiendo información que podría ser muy útil. A lo largo de los últimos años, han surgido muchos sensores de contaminación de bajo coste, pero su uso suele ser complicado, ya que tienen problemas de selectividad y precisión. Los objetivos de este proyecto son primero integrar tres sensores de contaminación de bajo coste en una Raspberry Pi y sobre todo, entrenar un modelo basado en una red neuronal artificial que mejore la precisión de las lecturas realizadas por los sensores.Està demostrat que la contaminació de l'aire és un gran problema per a la salut a nivell mundial, especialment en zones urbanes. A més, el cost dels sensors de contaminació de gama alta és considerablement alt, motiu pel qual els organismes públics no es poden permetre emplaçar una gran quantitat d'estacions de mesura, perdent informació que podria resultar molt útil. Al llarg dels últims anys, han sorgit molts sensors de contaminació de baix cost, però el seu ús és sovint complicat, ja que tenen problemes de selectivitat i precisió. Els objectius d'aquest projecte són primer de tot integrar tres sensors de contaminació de baix cost en una Raspberry Pi i sobretot, entrenar un model basat en una xarxa neuronal artificial que millori la precisió de les lectures realitzades pels sensors

UPCommons. Portal del coneixement obert de la UPC