73 research outputs found
Testing the order of a model
This paper deals with order identification for nested models in the i.i.d.
framework. We study the asymptotic efficiency of two generalized likelihood
ratio tests of the order. They are based on two estimators which are proved to
be strongly consistent. A version of Stein's lemma yields an optimal
underestimation error exponent. The lemma also implies that the overestimation
error exponent is necessarily trivial. Our tests admit nontrivial
underestimation error exponents. The optimal underestimation error exponent is
achieved in some situations. The overestimation error can decay exponentially
with respect to a positive power of the number of observations. These results
are proved under mild assumptions by relating the underestimation (resp.
overestimation) error to large (resp. moderate) deviations of the
log-likelihood process. In particular, it is not necessary that the classical
Cram\'{e}r condition be satisfied; namely, the -densities are not
required to admit every exponential moment. Three benchmark examples with
specific difficulties (location mixture of normal distributions, abrupt changes
and various regressions) are detailed so as to illustrate the generality of our
results.Comment: Published at http://dx.doi.org/10.1214/009053606000000344 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Faster Rates for Policy Learning
This article improves the existing proven rates of regret decay in optimal
policy estimation. We give a margin-free result showing that the regret decay
for estimating a within-class optimal policy is second-order for empirical risk
minimizers over Donsker classes, with regret decaying at a faster rate than the
standard error of an efficient estimator of the value of an optimal policy. We
also give a result from the classification literature that shows that faster
regret decay is possible via plug-in estimation provided a margin condition
holds. Four examples are considered. In these examples, the regret is expressed
in terms of either the mean value or the median value; the number of possible
actions is either two or finitely many; and the sampling scheme is either
independent and identically distributed or sequential, where the latter
represents a contextual bandit sampling scheme
Classification in postural style
This article contributes to the search for a notion of postural style,
focusing on the issue of classifying subjects in terms of how they maintain
posture. Longer term, the hope is to make it possible to determine on a case by
case basis which sensorial information is prevalent in postural control, and to
improve/adapt protocols for functional rehabilitation among those who show
deficits in maintaining posture, typically seniors. Here, we specifically
tackle the statistical problem of classifying subjects sampled from a two-class
population. Each subject (enrolled in a cohort of 54 participants) undergoes
four experimental protocols which are designed to evaluate potential deficits
in maintaining posture. These protocols result in four complex trajectories,
from which we can extract four small-dimensional summary measures. Because
undergoing several protocols can be unpleasant, and sometimes painful, we try
to limit the number of protocols needed for the classification. Therefore, we
first rank the protocols by decreasing order of relevance, then we derive four
plug-in classifiers which involve the best (i.e., more informative), the two
best, the three best and all four protocols. This two-step procedure relies on
the cutting-edge methodologies of targeted maximum likelihood learning (a
methodology for robust and efficient inference) and super-learning (a machine
learning procedure for aggregating various estimation procedures into a single
better estimation procedure). A simulation study is carried out. The
performances of the procedure applied to the real data set (and evaluated by
the leave-one-out rule) go as high as an 87% rate of correct classification (47
out of 54 subjects correctly classified), using only the best protocol.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS542 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Asymptotically Optimal Algorithms for Budgeted Multiple Play Bandits
We study a generalization of the multi-armed bandit problem with multiple
plays where there is a cost associated with pulling each arm and the agent has
a budget at each time that dictates how much she can expect to spend. We derive
an asymptotic regret lower bound for any uniformly efficient algorithm in our
setting. We then study a variant of Thompson sampling for Bernoulli rewards and
a variant of KL-UCB for both single-parameter exponential families and bounded,
finitely supported rewards. We show these algorithms are asymptotically
optimal, both in rateand leading problem-dependent constants, including in the
thick margin setting where multiple arms fall on the decision boundary
Practical targeted learning from large data sets by survey sampling
We address the practical construction of asymptotic confidence intervals for
smooth (i.e., path-wise differentiable), real-valued statistical parameters by
targeted learning from independent and identically distributed data in contexts
where sample size is so large that it poses computational challenges. We
observe some summary measure of all data and select a sub-sample from the
complete data set by Poisson rejective sampling with unequal inclusion
probabilities based on the summary measures. Targeted learning is carried out
from the easier to handle sub-sample. We derive a central limit theorem for the
targeted minimum loss estimator (TMLE) which enables the construction of the
confidence intervals. The inclusion probabilities can be optimized to reduce
the asymptotic variance of the TMLE. We illustrate the procedure with two
examples where the parameters of interest are variable importance measures of
an exposure (binary or continuous) on an outcome. We also conduct a simulation
study and comment on its results. keywords: semiparametric inference; survey
sampling; targeted minimum loss estimation (TMLE
Quantile Super Learning for independent and online settings with application to solar power forecasting
Estimating quantiles of an outcome conditional on covariates is of
fundamental interest in statistics with broad application in probabilistic
prediction and forecasting. We propose an ensemble method for conditional
quantile estimation, Quantile Super Learning, that combines predictions from
multiple candidate algorithms based on their empirical performance measured
with respect to a cross-validated empirical risk of the quantile loss function.
We present theoretical guarantees for both iid and online data scenarios. The
performance of our approach for quantile estimation and in forming prediction
intervals is tested in simulation studies. Two case studies related to solar
energy are used to illustrate Quantile Super Learning: in an iid setting, we
predict the physical properties of perovskite materials for photovoltaic cells,
and in an online setting we forecast ground solar irradiance based on output
from dynamic weather ensemble models
AdaptiveConformal: An R Package for Adaptive Conformal Inference
Conformal Inference (CI) is a popular approach for generating finite sample
prediction intervals based on the output of any point prediction method when
data are exchangeable. Adaptive Conformal Inference (ACI) algorithms extend CI
to the case of sequentially observed data, such as time series, and exhibit
strong theoretical guarantees without having to assume exchangeability of the
observed data. The common thread that unites algorithms in the ACI family is
that they adaptively adjust the width of the generated prediction intervals in
response to the observed data. We provide a detailed description of five ACI
algorithms and their theoretical guarantees, and test their performance in
simulation studies. We then present a case study of producing prediction
intervals for influenza incidence in the United States based on black-box point
forecasts. Implementations of all the algorithms are released as an open-source
R package, AdaptiveConformal, which also includes tools for visualizing and
summarizing conformal prediction intervals
Estimation and Testing in Targeted Group Sequential Covariate-adjusted Randomized Clinical Trials
This article is devoted to the construction and asymptotic study of adaptive group sequential covariate-adjusted randomized clinical trials analyzed through the prism of the semiparametric methodology of targeted maximum likelihood estimation (TMLE). We show how to build, as the data accrue group-sequentially, a sampling design which targets a user-supplied optimal design. We also show how to carry out a sound TMLE statistical inference based on such an adaptive sampling scheme (therefore extending some results known in the i.i.d setting only so far), and how group-sequential testing applies on top of it. The procedure is robust (i.e., consistent even if the working model is misspecified). A simulation study confirms the theoretical results, and validates the conjecture that the procedure may also be efficient
Targeting The Optimal Design In Randomized Clinical Trials With Binary Outcomes And No Covariate
This article is devoted to the asymptotic study of adaptive group sequential designs in the case of randomized clinical trials with binary treatment, binary outcome and no covariate. By adaptive design, we mean in this setting a clinical trial design that allows the investigator to dynamically modify its course through data-driven adjustment of the randomization probability based on data accrued so far, without negatively impacting on the statistical integrity of the trial. By adaptive group sequential design, we refer to the fact that group sequential testing methods can be equally well applied on top of adaptive designs. Prior to collection of the data, the trial protocol specifies the parameter of scientific interest. In the estimation framework, the trial protocol also a priori specifies the confidence level to be used in constructing frequentist confidence intervals for the latter parameter and the related inferential method, which will be based on the maximum likelihood principle. In the testing framework, the trial protocol also a priori specifies the null and alternative hypotheses regarding the latter parameter, the wished type I and type II errors, the rule for determining the maximal statistical information to be accrued, and the frequentist testing procedure, including conditions for early stopping. Furthermore, we assume that the protocol specifies a user-supplied optimal unknown choice of randomization scheme, and we will focus on that randomization scheme which minimizes the asymptotic variance of the maximum likelihood estimator of the parameter of interest.
We obtain that, theoretically, the adaptive design converges almost surely to the targeted unknown randomization scheme. In the estimation framework, we obtain that our maximum likelihood estimator of the parameter of interest is a strongly consistent estimator, and it satisfies a central limit theorem. We can estimate its asymptotic variance, which is the same as that it would feature had we known in advance the targeted randomization scheme and independently sampled from it. Consequently, inference can be carried out as if we had resorted to independent and identically distributed (iid) sampling. In the testing framework, we obtain that the multidimensional t-statistics that we would use under iid sampling still converges to the same canonical distribution under adaptive sampling. Consequently, the same group sequential testing can be carried out as if we had resorted to iid sampling. Furthermore, a comprehensive simulation study that we undertake validates the theory. It notably shows in the estimation framework that the confidence intervals we obtain achieve the desired coverage even for moderate sample sizes. In addition, it shows in the testing framework that type I error control at the prescribed level is guaranteed, and that all sampling procedures only suffer from a very slight increase of the type II error.
A three-sentence take-home message is: Adaptive designs do learn the targeted optimal design and inference and testing can be carried out under adaptive sampling as they would under the targeted optimal randomization probability iid sampling. In particular, adaptive designs achieve the same efficiency as the fixed oracle design. This is confirmed by a simulation study, at least for moderate or large sample sizes, across a large collection of targeted randomization probabilities
Estimation of a non-parametric variable importance measure of a continuous exposure
International audienceWe define a new measure of variable importance of an exposure on a continuous outcome, accounting for potential confounders. The exposure features a reference level x0 with positive mass and a continuum of other levels. For the purpose of estimating it, we fully develop the semi-parametric estimation methodology called targeted minimum loss estimation methodology (TMLE) [vanderLaan & Rubin 2006, van der Laan & Rose 2011]. We cover the whole spectrum of its theoretical study (convergence of the iterative procedure which is at the core of the TMLE methodology; consistency and asymptotic normality of the estimator), practical implementation, simulation study and application to a genomic example that originally motivated this article. In the latter, the exposure X and response Y are, respectively, the DNA copy number and expression level of a given gene in a cancer cell. Here, the reference level is x0=2, that is the expected DNA copy number in a normal cell. The confounder is a measure of the methylation of the gene. The fact that there is no clear biological indication that X and Y can be interpreted as an exposure and a response, respectively, is not problematic
- …