339,064 research outputs found
Targeted Learning on Variable Importance Measure for Heterogeneous Treatment Effect
Quantifying the heterogeneity of treatment effect is important for
understanding how a commercial product or medical treatment affects different
subgroups in a population. Beyond the overall impact reflected parameters like
the average treatment effect, the analysis of treatment effect heterogeneity
further reveals details on the importance of different covariates and how they
lead to different treatment impacts. One relevant parameter that addresses such
heterogeneity is the variance of treatment effect across different covariate
groups, however the treatment effect is defined. One can also derive variable
importance parameters that measure (and rank) how much of treatment effect
heterogeneity is explained by a targeted subset of covariates. In this article,
we propose a new targeted maximum likelihood estimator for a treatment effect
variable importance measure. This estimator is a pure plug-in estimator that
consists of two steps: 1) the initial estimation of relevant components to plug
in and 2) an iterative updating step to optimize the bias-variance tradeoff.
The simulation results show that this TMLE estimator has competitive
performance in terms of lower bias and better confidence interval coverage
compared to the simple substitution estimator and the estimating equation
estimator. The application of this method also demonstrates the advantage of a
substitution estimator, which always respects the global constraints on the
data distribution and that the estimand is a particular function of the
distribution
Practical targeted learning from large data sets by survey sampling
We address the practical construction of asymptotic confidence intervals for
smooth (i.e., path-wise differentiable), real-valued statistical parameters by
targeted learning from independent and identically distributed data in contexts
where sample size is so large that it poses computational challenges. We
observe some summary measure of all data and select a sub-sample from the
complete data set by Poisson rejective sampling with unequal inclusion
probabilities based on the summary measures. Targeted learning is carried out
from the easier to handle sub-sample. We derive a central limit theorem for the
targeted minimum loss estimator (TMLE) which enables the construction of the
confidence intervals. The inclusion probabilities can be optimized to reduce
the asymptotic variance of the TMLE. We illustrate the procedure with two
examples where the parameters of interest are variable importance measures of
an exposure (binary or continuous) on an outcome. We also conduct a simulation
study and comment on its results. keywords: semiparametric inference; survey
sampling; targeted minimum loss estimation (TMLE
Nonparametric variable importance assessment using machine learning techniques
In a regression setting, it is often of interest to quantify the importance of various features in predicting the response. Commonly, the variable importance measure used is determined by the regression technique employed. For this reason, practitioners often only resort to one of a few regression techniques for which a variable importance measure is naturally defined. Unfortunately, these regression techniques are often sub-optimal for predicting response. Additionally, because the variable importance measures native to different regression techniques generally have a different interpretation, comparisons across techniques can be difficult. In this work, we study a novel variable importance measure that can be used with any regression technique, and whose interpretation is agnostic to the technique used. Specifically, we propose a generalization of the ANOVA variable importance measure, and discuss how it facilitates the use of possibly-complex machine learning techniques to flexibly estimate the variable importance of a single feature or group of features. Using the tools of targeted learning, we also describe how to construct an efficient estimator of this measure, as well as a valid confidence interval. Through simulations, we show that our proposal has good practical operating characteristics, and we illustrate its use with data from a study of the median house price in the Boston area, and a study of risk factors for cardiovascular disease in South Africa
Variable Importance Analysis with the multiPIM R Package
We describe the R package multiPIM, including statistical background, functionality and user options. The package is for variable importance analysis, and is meant primarily for analyzing data from exploratory epidemiological studies, though it could certainly be applied in other areas as well. The approach taken to variable importance comes from the causal inference field, and is different from approaches taken in other R packages. By default, multiPIM uses a double robust targeted maximum likelihood estimator (TMLE) of a parameter akin to the attributable risk. Several regression methods/machine learning algorithms are available for estimating the nuisance parameters of the models, including super learner, a meta-learner which combines several different algorithms into one. We describe a simulation in which the double robust TMLE is compared to the graphical computation estimator. We also provide example analyses using two data sets which are included with the package
A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology
The widespread availability of high-dimensional biological data has made the
simultaneous screening of numerous biological characteristics a central
statistical problem in computational biology. While the dimensionality of such
datasets continues to increase, the problem of teasing out the effects of
biomarkers in studies measuring baseline confounders while avoiding model
misspecification remains only partially addressed. Efficient estimators
constructed from data adaptive estimates of the data-generating distribution
provide an avenue for avoiding model misspecification; however, in the context
of high-dimensional problems requiring simultaneous estimation of numerous
parameters, standard variance estimators have proven unstable, resulting in
unreliable Type-I error control under standard multiple testing corrections. We
present the formulation of a general approach for applying empirical Bayes
shrinkage approaches to asymptotically linear estimators of parameters defined
in the nonparametric model. The proposal applies existing shrinkage estimators
to the estimated variance of the influence function, allowing for increased
inferential stability in high-dimensional settings. A methodology for
nonparametric variable importance analysis for use with high-dimensional
biological datasets with modest sample sizes is introduced and the proposed
technique is demonstrated to be robust in small samples even when relying on
data adaptive estimators that eschew parametric forms. Use of the proposed
variance moderation strategy in constructing stabilized variable importance
measures of biomarkers is demonstrated by application to an observational study
of occupational exposure. The result is a data adaptive approach for robustly
uncovering stable associations in high-dimensional data with limited sample
sizes
ZOO: Zeroth Order Optimization based Black-box Attacks to Deep Neural Networks without Training Substitute Models
Deep neural networks (DNNs) are one of the most prominent technologies of our
time, as they achieve state-of-the-art performance in many machine learning
tasks, including but not limited to image classification, text mining, and
speech processing. However, recent research on DNNs has indicated
ever-increasing concern on the robustness to adversarial examples, especially
for security-critical tasks such as traffic sign identification for autonomous
driving. Studies have unveiled the vulnerability of a well-trained DNN by
demonstrating the ability of generating barely noticeable (to both human and
machines) adversarial images that lead to misclassification. Furthermore,
researchers have shown that these adversarial images are highly transferable by
simply training and attacking a substitute model built upon the target model,
known as a black-box attack to DNNs.
Similar to the setting of training substitute models, in this paper we
propose an effective black-box attack that also only has access to the input
(images) and the output (confidence scores) of a targeted DNN. However,
different from leveraging attack transferability from substitute models, we
propose zeroth order optimization (ZOO) based attacks to directly estimate the
gradients of the targeted DNN for generating adversarial examples. We use
zeroth order stochastic coordinate descent along with dimension reduction,
hierarchical attack and importance sampling techniques to efficiently attack
black-box models. By exploiting zeroth order optimization, improved attacks to
the targeted DNN can be accomplished, sparing the need for training substitute
models and avoiding the loss in attack transferability. Experimental results on
MNIST, CIFAR10 and ImageNet show that the proposed ZOO attack is as effective
as the state-of-the-art white-box attack and significantly outperforms existing
black-box attacks via substitute models.Comment: Accepted by 10th ACM Workshop on Artificial Intelligence and Security
(AISEC) with the 24th ACM Conference on Computer and Communications Security
(CCS
Artificial Neural Networks applied to improve low-cost air quality monitoring precision
It is a fact that air pollution is a major environmental health problem that affects everyone, especially in urban areas. Furthermore, the cost of high-end air pollution monitoring sensors is considerably high, so public administrations are unable to afford to place an elevated number of measuring stations, leading to the loss of information that could be very helpful. Over the last few years, a large number of low-cost sensors have been released, but its use is often problematic, due to their selectivity and precision problems. A calibration process is needed in order to solve an issue with many parameters with no clear relationship among them, which is a field of application of Machine Learning. The objectives of this project are first, integrating three low-cost air quality sensors into a Raspberry Pi and then, training an Artificial Neural Network model that improves precision in the readings made by the sensors.Es un hecho que la contaminación del aire es un gran problema para la salud a nivel mundial, especialmente en zonas urbanas. Además, el coste de los sensores de contaminación de gama alta es considerablemente alto, por lo que los organismos públicos no pueden permitirse emplazar un gran número de estaciones de medida, perdiendo información que podría ser muy útil. A lo largo de los últimos años, han surgido muchos sensores de contaminación de bajo coste, pero su uso suele ser complicado, ya que tienen problemas de selectividad y precisión. Los objetivos de este proyecto son primero integrar tres sensores de contaminación de bajo coste en una Raspberry Pi y sobre todo, entrenar un modelo basado en una red neuronal artificial que mejore la precisión de las lecturas realizadas por los sensores.Està demostrat que la contaminació de l'aire és un gran problema per a la salut a nivell mundial, especialment en zones urbanes. A més, el cost dels sensors de contaminació de gama alta és considerablement alt, motiu pel qual els organismes públics no es poden permetre emplaçar una gran quantitat d'estacions de mesura, perdent informació que podria resultar molt útil. Al llarg dels últims anys, han sorgit molts sensors de contaminació de baix cost, però el seu ús és sovint complicat, ja que tenen problemes de selectivitat i precisió. Els objectius d'aquest projecte són primer de tot integrar tres sensors de contaminació de baix cost en una Raspberry Pi i sobretot, entrenar un model basat en una xarxa neuronal artificial que millori la precisió de les lectures realitzades pels sensors
- …