21 research outputs found
Estimation robuste pour des distributions Ă queue lourde
In this thesis, we are interested in estimating the mean of heavy-tailed random variables. We focus on a robust estimation of the mean approach as an alternative to the classical empirical mean estimation. The goal is to develop sub-Gaussian concentration inequalities for the estimating error. In other words, we seek strong concentration results usually obtained for bounded random variables, in the context where the bounded condition is replaced by a finite variance condition. Two existing estimators of the mean of a real-valued random variable are invoked and their concentration results are recalled. Several new higher dimension adaptations are discussed. Using those estimators, we introduce a new version of empirical risk minimization for heavy-tailed random variables. Some applications are developed. These results are illustrated by simulations on artificial data samples. Lastly, we study the multivariate case in the U-statistics context. A natural generalization of existing estimators is offered, once again, by previous estimators.Nous nous intéressons à estimer la moyenne d'une variable aléatoire de loi à queue lourde. Nous adoptons une approche plus robuste que la moyenne empirique classique communément utilisée. L'objectif est de développer des inégalités de concentration de type sous-gaussien sur l'erreur d'estimation. En d'autres termes, nous cherchons à garantir une forte concentration sous une hypothèse plus faible que la bornitude : une variance finie. Deux estimateurs de la moyenne pour une loi à support réel sont invoqués et leurs résultats de concentration sont rappelés. Plusieurs adaptations en dimension supérieure sont envisagées. L'utilisation appropriée de ces estimateurs nous permet d'introduire une nouvelle technique de minimisation du risque empirique pour des variables aléatoires à queue lourde. Quelques applications de cette technique sont développées. Nous appuyons ces résultats sur des simulations sur des jeux de données simulées. Dans un troisième temps, nous étudions un problème d'estimation multivarié dans le cadre des U-statistiques où les estimateurs précédents offrent, là aussi, une généralisation naturelle d'estimateurs présents dans la littérature
Practical targeted learning from large data sets by survey sampling
We address the practical construction of asymptotic confidence intervals for
smooth (i.e., path-wise differentiable), real-valued statistical parameters by
targeted learning from independent and identically distributed data in contexts
where sample size is so large that it poses computational challenges. We
observe some summary measure of all data and select a sub-sample from the
complete data set by Poisson rejective sampling with unequal inclusion
probabilities based on the summary measures. Targeted learning is carried out
from the easier to handle sub-sample. We derive a central limit theorem for the
targeted minimum loss estimator (TMLE) which enables the construction of the
confidence intervals. The inclusion probabilities can be optimized to reduce
the asymptotic variance of the TMLE. We illustrate the procedure with two
examples where the parameters of interest are variable importance measures of
an exposure (binary or continuous) on an outcome. We also conduct a simulation
study and comment on its results. keywords: semiparametric inference; survey
sampling; targeted minimum loss estimation (TMLE
On the consistency of a random forest algorithm in the presence of missing entries
This paper tackles the problem of constructing a non-parametric predictor
when the latent variables are given with incomplete information. The convenient
predictor for this task is the random forest algorithm in conjunction to the so
called CART criterion. The proposed technique enables a partial imputation of
the missing values in the data set in a way that suits both a consistent
estimation of the regression function as well as a probabilistic recovery of
the missing values. A proof of the consistency of the random forest estimator
that also simplifies the previous proofs of the classical consistency is given
in the case where each latent variable is missing completely at random (MCAR)
GROS: A General Robust Aggregation Strategy
A new, very general, robust procedure for combining estimators in metric
spaces is introduced GROS. The method is reminiscent of the well-known median
of means, as described in \cite{devroye2016sub}. Initially, the sample is
divided into groups. Subsequently, an estimator is computed for each group.
Finally, these estimators are combined using a robust procedure. We prove
that this estimator is sub-Gaussian and we get its break-down point, in the
sense of Donoho. The robust procedure involves a minimization problem on a
general metric space, but we show that the same (up to a constant)
sub-Gaussianity is obtained if the minimization is taken over the sample,
making GROS feasible in practice. The performance of GROS is evaluated through
five simulation studies: the first one focuses on classification using
-means, the second one on the multi-armed bandit problem, the third one on
the regression problem. The fourth one is the set estimation problem under a
noisy model. Lastly, we apply GROS to get a robust persistent diagram
Invasiveness of an introduced species: the role of hybridization and ecological constraints
International audienceIntroduced species are confronted with new environments to which they need to adapt. However, the ecological success of an introduced species is generally difficult to predict, especially when hybridizations may be involved in the invasion success. In western Europe, the lake frog Pelophylax ridibundus appears to be particularly successful. A reason for this species' success might be the presence of the invader's genetic material prior to the introduction in the form of a hybrid between P. ridibundus and a second indigenous water frog species. These hybrids reproduce by hybridogenesis, only transmitting the ridibundus genome to gametes and backcrossing with the indigenous species (i.e. P. lessonae). This reproductive system allows the hybrid to be independent from P. ridibundus, and allows the ridibundus genome to be more widely spread than the species itself. Matings among hybrids produce newly formed P. ridibundus offspring (N), if the genomes are compatible. Therefore, we hypothesize that hybridogenesis increases the invasiveness of P. ridibundus (1) by enhancing propagule pressure through N individuals, and/or (2) by increasing adaptation of invaders to the native water frogs' habitat through hybrid-derived ridibundus genomes that are locally adapted. We find support for the first hypothesis because a notable fraction of N tadpoles is viable. However, in our semi-natural experiments they did not outperform ridibundus tadpoles in the native water frogs' habitat, nor did they differ physiologically. This does not support the second hypothesis and highlights ecological constraints on the invasion. However, we cannot rule out that these constraints may fall with ongoing selection, making a replacement of indigenous species highly probable in the future
Robust estimation of heavy-tailed distributions
Nous nous intéressons à estimer la moyenne d'une variable aléatoire de loi à queue lourde. Nous adoptons une approche plus robuste que la moyenne empirique classique communément utilisée. L'objectif est de développer des inégalités de concentration de type sous-gaussien sur l'erreur d'estimation. En d'autres termes, nous cherchons à garantir une forte concentration sous une hypothèse plus faible que la bornitude : une variance finie. Deux estimateurs de la moyenne pour une loi à support réel sont invoqués et leurs résultats de concentration sont rappelés. Plusieurs adaptations en dimension supérieure sont envisagées. L'utilisation appropriée de ces estimateurs nous permet d'introduire une nouvelle technique de minimisation du risque empirique pour des variables aléatoires à queue lourde. Quelques applications de cette technique sont développées. Nous appuyons ces résultats sur des simulations sur des jeux de données simulées. Dans un troisième temps, nous étudions un problème d'estimation multivarié dans le cadre des U-statistiques où les estimateurs précédents offrent, là aussi, une généralisation naturelle d'estimateurs présents dans la littérature.In this thesis, we are interested in estimating the mean of heavy-tailed random variables. We focus on a robust estimation of the mean approach as an alternative to the classical empirical mean estimation. The goal is to develop sub-Gaussian concentration inequalities for the estimating error. In other words, we seek strong concentration results usually obtained for bounded random variables, in the context where the bounded condition is replaced by a finite variance condition. Two existing estimators of the mean of a real-valued random variable are invoked and their concentration results are recalled. Several new higher dimension adaptations are discussed. Using those estimators, we introduce a new version of empirical risk minimization for heavy-tailed random variables. Some applications are developed. These results are illustrated by simulations on artificial data samples. Lastly, we study the multivariate case in the U-statistics context. A natural generalization of existing estimators is offered, once again, by previous estimators
A tractable non-adaptative group testing method for non-binary measurements
16 pages and 5 figuresThe original problem of group testing consists in the identification of defective items in a collection, by applying tests on groups of items that detect the presence of at least one defective item in the group. The aim is then to identify all defective items of the collection with as few tests as possible. This problem is relevant in several fields, among which biology and computer sciences. It recently gained attraction as a potential tool to solve a shortage of COVID-19 test kits, in particular for RT-qPCR. However, the problem of group testing is not an exact match to this implementation. Indeed, contrarily to the original problem, PCR testing employed for the detection of COVID-19 returns more than a simple binary contaminated/non-contaminated value when applied to a group of samples collected on different individuals. It gives a real value representing the viral load in the sample instead. We study here the use that can be made of this extra piece of information to construct a one-stage pool testing algorithms on an idealize version of this model. We show that under the right conditions, the total number of tests needed to detect contaminated samples diminishes drastically