21 research outputs found

    Estimation robuste pour des distributions Ă  queue lourde

    Get PDF
    In this thesis, we are interested in estimating the mean of heavy-tailed random variables. We focus on a robust estimation of the mean approach as an alternative to the classical empirical mean estimation. The goal is to develop sub-Gaussian concentration inequalities for the estimating error. In other words, we seek strong concentration results usually obtained for bounded random variables, in the context where the bounded condition is replaced by a finite variance condition. Two existing estimators of the mean of a real-valued random variable are invoked and their concentration results are recalled. Several new higher dimension adaptations are discussed. Using those estimators, we introduce a new version of empirical risk minimization for heavy-tailed random variables. Some applications are developed. These results are illustrated by simulations on artificial data samples. Lastly, we study the multivariate case in the U-statistics context. A natural generalization of existing estimators is offered, once again, by previous estimators.Nous nous intéressons à estimer la moyenne d'une variable aléatoire de loi à queue lourde. Nous adoptons une approche plus robuste que la moyenne empirique classique communément utilisée. L'objectif est de développer des inégalités de concentration de type sous-gaussien sur l'erreur d'estimation. En d'autres termes, nous cherchons à garantir une forte concentration sous une hypothèse plus faible que la bornitude : une variance finie. Deux estimateurs de la moyenne pour une loi à support réel sont invoqués et leurs résultats de concentration sont rappelés. Plusieurs adaptations en dimension supérieure sont envisagées. L'utilisation appropriée de ces estimateurs nous permet d'introduire une nouvelle technique de minimisation du risque empirique pour des variables aléatoires à queue lourde. Quelques applications de cette technique sont développées. Nous appuyons ces résultats sur des simulations sur des jeux de données simulées. Dans un troisième temps, nous étudions un problème d'estimation multivarié dans le cadre des U-statistiques où les estimateurs précédents offrent, là aussi, une généralisation naturelle d'estimateurs présents dans la littérature

    Practical targeted learning from large data sets by survey sampling

    Get PDF
    We address the practical construction of asymptotic confidence intervals for smooth (i.e., path-wise differentiable), real-valued statistical parameters by targeted learning from independent and identically distributed data in contexts where sample size is so large that it poses computational challenges. We observe some summary measure of all data and select a sub-sample from the complete data set by Poisson rejective sampling with unequal inclusion probabilities based on the summary measures. Targeted learning is carried out from the easier to handle sub-sample. We derive a central limit theorem for the targeted minimum loss estimator (TMLE) which enables the construction of the confidence intervals. The inclusion probabilities can be optimized to reduce the asymptotic variance of the TMLE. We illustrate the procedure with two examples where the parameters of interest are variable importance measures of an exposure (binary or continuous) on an outcome. We also conduct a simulation study and comment on its results. keywords: semiparametric inference; survey sampling; targeted minimum loss estimation (TMLE

    On the consistency of a random forest algorithm in the presence of missing entries

    Full text link
    This paper tackles the problem of constructing a non-parametric predictor when the latent variables are given with incomplete information. The convenient predictor for this task is the random forest algorithm in conjunction to the so called CART criterion. The proposed technique enables a partial imputation of the missing values in the data set in a way that suits both a consistent estimation of the regression function as well as a probabilistic recovery of the missing values. A proof of the consistency of the random forest estimator that also simplifies the previous proofs of the classical consistency is given in the case where each latent variable is missing completely at random (MCAR)

    GROS: A General Robust Aggregation Strategy

    Full text link
    A new, very general, robust procedure for combining estimators in metric spaces is introduced GROS. The method is reminiscent of the well-known median of means, as described in \cite{devroye2016sub}. Initially, the sample is divided into KK groups. Subsequently, an estimator is computed for each group. Finally, these KK estimators are combined using a robust procedure. We prove that this estimator is sub-Gaussian and we get its break-down point, in the sense of Donoho. The robust procedure involves a minimization problem on a general metric space, but we show that the same (up to a constant) sub-Gaussianity is obtained if the minimization is taken over the sample, making GROS feasible in practice. The performance of GROS is evaluated through five simulation studies: the first one focuses on classification using kk-means, the second one on the multi-armed bandit problem, the third one on the regression problem. The fourth one is the set estimation problem under a noisy model. Lastly, we apply GROS to get a robust persistent diagram

    Invasiveness of an introduced species: the role of hybridization and ecological constraints

    Get PDF
    International audienceIntroduced species are confronted with new environments to which they need to adapt. However, the ecological success of an introduced species is generally difficult to predict, especially when hybridizations may be involved in the invasion success. In western Europe, the lake frog Pelophylax ridibundus appears to be particularly successful. A reason for this species' success might be the presence of the invader's genetic material prior to the introduction in the form of a hybrid between P. ridibundus and a second indigenous water frog species. These hybrids reproduce by hybridogenesis, only transmitting the ridibundus genome to gametes and backcrossing with the indigenous species (i.e. P. lessonae). This reproductive system allows the hybrid to be independent from P. ridibundus, and allows the ridibundus genome to be more widely spread than the species itself. Matings among hybrids produce newly formed P. ridibundus offspring (N), if the genomes are compatible. Therefore, we hypothesize that hybridogenesis increases the invasiveness of P. ridibundus (1) by enhancing propagule pressure through N individuals, and/or (2) by increasing adaptation of invaders to the native water frogs' habitat through hybrid-derived ridibundus genomes that are locally adapted. We find support for the first hypothesis because a notable fraction of N tadpoles is viable. However, in our semi-natural experiments they did not outperform ridibundus tadpoles in the native water frogs' habitat, nor did they differ physiologically. This does not support the second hypothesis and highlights ecological constraints on the invasion. However, we cannot rule out that these constraints may fall with ongoing selection, making a replacement of indigenous species highly probable in the future

    Robust estimation of heavy-tailed distributions

    No full text
    Nous nous intéressons à estimer la moyenne d'une variable aléatoire de loi à queue lourde. Nous adoptons une approche plus robuste que la moyenne empirique classique communément utilisée. L'objectif est de développer des inégalités de concentration de type sous-gaussien sur l'erreur d'estimation. En d'autres termes, nous cherchons à garantir une forte concentration sous une hypothèse plus faible que la bornitude : une variance finie. Deux estimateurs de la moyenne pour une loi à support réel sont invoqués et leurs résultats de concentration sont rappelés. Plusieurs adaptations en dimension supérieure sont envisagées. L'utilisation appropriée de ces estimateurs nous permet d'introduire une nouvelle technique de minimisation du risque empirique pour des variables aléatoires à queue lourde. Quelques applications de cette technique sont développées. Nous appuyons ces résultats sur des simulations sur des jeux de données simulées. Dans un troisième temps, nous étudions un problème d'estimation multivarié dans le cadre des U-statistiques où les estimateurs précédents offrent, là aussi, une généralisation naturelle d'estimateurs présents dans la littérature.In this thesis, we are interested in estimating the mean of heavy-tailed random variables. We focus on a robust estimation of the mean approach as an alternative to the classical empirical mean estimation. The goal is to develop sub-Gaussian concentration inequalities for the estimating error. In other words, we seek strong concentration results usually obtained for bounded random variables, in the context where the bounded condition is replaced by a finite variance condition. Two existing estimators of the mean of a real-valued random variable are invoked and their concentration results are recalled. Several new higher dimension adaptations are discussed. Using those estimators, we introduce a new version of empirical risk minimization for heavy-tailed random variables. Some applications are developed. These results are illustrated by simulations on artificial data samples. Lastly, we study the multivariate case in the U-statistics context. A natural generalization of existing estimators is offered, once again, by previous estimators

    A tractable non-adaptative group testing method for non-binary measurements

    No full text
    16 pages and 5 figuresThe original problem of group testing consists in the identification of defective items in a collection, by applying tests on groups of items that detect the presence of at least one defective item in the group. The aim is then to identify all defective items of the collection with as few tests as possible. This problem is relevant in several fields, among which biology and computer sciences. It recently gained attraction as a potential tool to solve a shortage of COVID-19 test kits, in particular for RT-qPCR. However, the problem of group testing is not an exact match to this implementation. Indeed, contrarily to the original problem, PCR testing employed for the detection of COVID-19 returns more than a simple binary contaminated/non-contaminated value when applied to a group of samples collected on different individuals. It gives a real value representing the viral load in the sample instead. We study here the use that can be made of this extra piece of information to construct a one-stage pool testing algorithms on an idealize version of this model. We show that under the right conditions, the total number of tests needed to detect contaminated samples diminishes drastically
    corecore