315 research outputs found
Risk bounds for purely uniformly random forests
Random forests, introduced by Leo Breiman in 2001, are a very effective
statistical method. The complex mechanism of the method makes theoretical
analysis difficult. Therefore, a simplified version of random forests, called
purely random forests, which can be theoretically handled more easily, has been
considered. In this paper we introduce a variant of this kind of random
forests, that we call purely uniformly random forests. In the context of
regression problems with a one-dimensional predictor space, we show that both
random trees and random forests reach minimax rate of convergence. In addition,
we prove that compared to random trees, random forests improve accuracy by
reducing the estimator variance by a factor of three fourths
Analysis of purely random forests bias
Random forests are a very effective and commonly used statistical method, but
their full theoretical analysis is still an open problem. As a first step,
simplified models such as purely random forests have been introduced, in order
to shed light on the good performance of random forests. In this paper, we
study the approximation error (the bias) of some purely random forest models in
a regression framework, focusing in particular on the influence of the number
of trees in the forest. Under some regularity assumptions on the regression
function, we show that the bias of an infinite forest decreases at a faster
rate (with respect to the size of each tree) than a single tree. As a
consequence, infinite forests attain a strictly better risk rate (with respect
to the sample size) than single trees. Furthermore, our results allow to derive
a minimum number of trees sufficient to reach the same rate as an infinite
forest. As a by-product of our analysis, we also show a link between the bias
of purely random forests and the bias of some kernel estimators
Gametocytes infectiousness to mosquitoes: variable selection using random forests, and zero inflated models
Malaria control strategies aiming at reducing disease transmission intensity
may impact both oocyst intensity and infection prevalence in the mosquito
vector. Thus far, mathematical models failed to identify a clear relationship
between Plasmodium falciparum gametocytes and their infectiousness to
mosquitoes. Natural isolates of gametocytes are genetically diverse and
biologically complex. Infectiousness to mosquitoes relies on multiple
parameters such as density, sex-ratio, maturity, parasite genotypes and host
immune factors. In this article, we investigated how density and genetic
diversity of gametocytes impact on the success of transmission in the mosquito
vector. We analyzed data for which the number of covariates plus attendant
interactions is at least of order of the sample size, precluding usage of
classical models such as general linear models. We then considered the variable
importance from random forests to address the problem of selecting the most
influent variables. The selected covariates were assessed in the zero inflated
negative binomial model which accommodates both over-dispersion and the sources
of non infected mosquitoes. We found that the most important covariates related
to infection prevalence and parasite intensity are gametocyte density and
multiplicity of infection
Random Forests: some methodological insights
This paper examines from an experimental perspective random forests, the
increasingly used statistical method for classification and regression problems
introduced by Leo Breiman in 2001. It first aims at confirming, known but
sparse, advice for using random forests and at proposing some complementary
remarks for both standard problems as well as high dimensional ones for which
the number of variables hugely exceeds the sample size. But the main
contribution of this paper is twofold: to provide some insights about the
behavior of the variable importance index based on random forests and in
addition, to propose to investigate two classical issues of variable selection.
The first one is to find important variables for interpretation and the second
one is more restrictive and try to design a good prediction model. The strategy
involves a ranking of explanatory variables using the random forests score of
importance and a stepwise ascending variable introduction strategy
Random Forests for Big Data
Big Data is one of the major challenges of statistical science and has
numerous consequences from algorithmic and theoretical viewpoints. Big Data
always involve massive data but they also often include online data and data
heterogeneity. Recently some statistical methods have been adapted to process
Big Data, like linear regression models, clustering methods and bootstrapping
schemes. Based on decision trees combined with aggregation and bootstrap ideas,
random forests were introduced by Breiman in 2001. They are a powerful
nonparametric statistical method allowing to consider in a single and versatile
framework regression problems, as well as two-class and multi-class
classification problems. Focusing on classification problems, this paper
proposes a selective review of available proposals that deal with scaling
random forests to Big Data problems. These proposals rely on parallel
environments or on online adaptations of random forests. We also describe how
related quantities -- such as out-of-bag error and variable importance -- are
addressed in these methods. Then, we formulate various remarks for random
forests in the Big Data context. Finally, we experiment five variants on two
massive datasets (15 and 120 millions of observations), a simulated one as well
as real world data. One variant relies on subsampling while three others are
related to parallel implementations of random forests and involve either
various adaptations of bootstrap to Big Data or to "divide-and-conquer"
approaches. The fifth variant relates on online learning of random forests.
These numerical experiments lead to highlight the relative performance of the
different variants, as well as some of their limitations
Variable selection using Random Forests
International audienceThis paper proposes, focusing on random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001, to investigate two classical issues of variable selection. The first one is to find important variables for interpretation and the second one is more restrictive and try to design a good parsimonious prediction model. The main contribution is twofold: to provide some experimental insights about the behavior of the variable importance index based on random forests and to propose a strategy involving a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy
Bornes de risque pour les forêts purement uniformément aléatoires
International audienceIntroduites par Leo Breiman en 2001, les forêts aléatoires sont une méthode statistique très performante. D'un point de vue théorique, leur analyse est difficile, du fait de la complexité de l'algorithme. Pour expliquer ces performances, des versions de forêts aléatoires simplifiées (et donc plus faciles à analyser) ont été introduites : les forêts purement aléatoires. Dans cet article, nous introduisons une autre version simplifiée, que nous appelons forêts purement uniformément aléatoires. Dans un contexte de régression avec une seule variable explicative, nous montrons que les arbres aléatoires ainsi que les forêts aléatoires atteignent la vitesse de convergence minimax. Et plus important, nous prouvons que les forêts aléatoires améliorent les performances des arbres aléatoires, en réduisant la variance des estimateurs associés d'un facteur trois quarts
Fr\'echet random forests for metric space valued regression with non euclidean predictors
Random forests are a statistical learning method widely used in many areas of
scientific research because of its ability to learn complex relationships
between input and output variables and also their capacity to handle
high-dimensional data. However, current random forest approaches are not
flexible enough to handle heterogeneous data such as curves, images and shapes.
In this paper, we introduce Fr\'echet trees and Fr\'echet random forests, which
allow to handle data for which input and output variables take values in
general metric spaces (which can be unordered). To this end, a new way of
splitting the nodes of trees is introduced and the prediction procedures of
trees and forests are generalized. Then, random forests out-of-bag error and
variable importance score are naturally adapted. A consistency theorem for
Fr\'echet regressogram predictor using data-driven partitions is given and
applied to Fr\'echet purely uniformly random trees. The method is studied
through several simulation scenarios on heterogeneous data combining
longitudinal, image and scalar data. Finally, two real datasets from HIV
vaccine trials are analyzed with the proposed method
Gametocytes infectiousness to mosquitoes: variable selection using random forests, and zero inflated models
Malaria control strategies aiming at reducing disease transmission intensity may impact both oocyst intensity and infection prevalence in the mosquito vector. Thus far, mathematical models failed to identify a clear relationship between Plasmodium falciparum gametocytes and their infectiousness to mosquitoes. Natural isolates of gametocytes are genetically diverse and biologically complex. Infectiousness to mosquitoes relies on multiple parameters such as density, sex-ratio, maturity, parasite genotypes and host immune factors. In this article, we investigated how density and genetic diversity of gametocytes impact on the success of transmission in the mosquito vector. We analyzed data for which the number of covariates plus attendant interactions is at least of order of the sample size, precluding usage of classical models such as general linear models. We then considered the variable importance from random forests to address the problem of selecting the most influent variables. The selected covariates were assessed in the zero inflated negative binomial model which accommodates both over-dispersion and the sources of non infected mosquitoes. We found that the most important covariates related to infection prevalence and parasite intensity are gametocyte density and multiplicity of infection
Arbres CART et Forêts aléatoires,Importance et sélection de variables
Two algorithms proposed by Leo Breiman : CART trees (Classification And Regression Trees for) introduced in the first half of the 80s and random forests emerged, meanwhile, in the early 2000s, are the subject of this article. The goal is to provide each of the topics, a presentation, a theoretical guarantee, an example and some variants and extensions. After a preamble, introduction recalls objectives of classification and regression problems before retracing some predecessors of the Random Forests. Then, a section is devoted to CART trees then random forests are presented. Then, a variable selection procedure based on permutation variable importance is proposed. Finally the adaptation of random forests to the Big Data context is sketched.Deux des algorithmes proposés par Leo Breiman : les arbres CART (pour Classification And Regression Trees) introduits dans la première moitié des années 80 et les forêts aléatoires apparues, quant à elles, au début des années 2000, font l'objet de cet article. L'objectif est de proposer sur chacun des thèmes abordés, un exposé, une garantie théorique, un exemple et signaler variantes et extensions. Après un préambule, l'introduction rappelle les objectifs des problèmes de classification et de régression avant de retracer quelques prédécesseurs des forêts aléatoires. Ensuite, une section est consa-crée aux arbres CART puis les forêts aléatoires sont présentées. Ensuite, une procédure de sélection de variables basée sur la quantification de l'importance des variables est proposée. Enfin l'adaptation des forêts aléatoires au contexte du Big Data est esquissée
- …