Search CORE

132 research outputs found

Random Forests: some methodological insights

Author: Genuer Robin
Poggi Jean-Michel
Tuleau Christine
Publication venue
Publication date: 01/01/2008
Field of study

This paper examines from an experimental perspective random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001. It first aims at confirming, known but sparse, advice for using random forests and at proposing some complementary remarks for both standard problems as well as high dimensional ones for which the number of variables hugely exceeds the sample size. But the main contribution of this paper is twofold: to provide some insights about the behavior of the variable importance index based on random forests and in addition, to propose to investigate two classical issues of variable selection. The first one is to find important variables for interpretation and the second one is more restrictive and try to design a good prediction model. The strategy involves a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy

arXiv.org e-Print Archive

HAL-UNICE

INRIA a CCSD electronic archive server

Optimized clusters for disaggregated electricity load forecasting

Author: Jean-michel Poggi
Michel Misiti
Yves Misiti
Publication venue
Publication date
Field of study

To account for the variation of EDF’s (the French electrical company) portfolio following the liberalization of the electrical market, it is essential to disaggregate the global load curve. The idea is to disaggregate the global signal in such a way that the sum of disaggregated forecasts significantly improves the prediction of the whole global signal. The strategy is to optimize, a preliminary clustering of individual load curves with respect to a predictability index. The optimized clustering procedure is controlled by a forecasting performance via a cross-prediction dissimilarity index. It can be assimilated to a discrete gradient type algorithm. Key-Words

CiteSeerX

Random Forests for Big Data

Author: Genuer Robin
Poggi Jean-Michel
Tuleau-Malot Christine
Villa-Vialaneix Nathalie
Publication venue
Publication date: 19/11/2015
Field of study

Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities -- such as out-of-bag error and variable importance -- are addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or to "divide-and-conquer" approaches. The fifth variant relates on online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

ProdInra

Hal-Diderot

Variable selection using Random Forests

Author: Genuer Robin
Poggi Jean-Michel
Tuleau-Malot Christine
Publication venue: 'Elsevier BV'
Publication date: 15/10/2010
Field of study

International audienceThis paper proposes, focusing on random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001, to investigate two classical issues of variable selection. The first one is to find important variables for interpretation and the second one is more restrictive and try to design a good parsimonious prediction model. The main contribution is twofold: to provide some experimental insights about the behavior of the variable importance index based on random forests and to propose a strategy involving a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy

HAL Descartes

Clustering functional data using wavelets

Author: Antoniadis Anestis
Brossat Xavier
Cugliari Jairo
Poggi Jean-Michel
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 22/08/2010
Field of study

We present two methods for detecting patterns and clusters in high dimensional time-dependent functional data. Our methods are based on wavelet-based similarity measures, since wavelets are well suited for identifying highly discriminant local time and scale features. The multiresolution aspect of the wavelet transform provides a time-scale decomposition of the signals allowing to visualize and to cluster the functional data into homogeneous groups. For each input function, through its empirical orthogonal wavelet transform the first method uses the distribution of energy across scales generate a handy number of features that can be sufficient to still make the signals well distinguishable. Our new similarity measure combined with an efficient feature selection technique in the wavelet domain is then used within more or less classical clustering algorithms to effectively differentiate among high dimensional populations. The second method uses dissimilarity measures between the whole time-scale representations and are based on wavelet-coherence tools. The clustering is then performed using a k-centroid algorithm starting from these dissimilarities. Practical performance of these methods that jointly designs both the feature selection in the wavelet domain and the classification distance is demonstrated through simulations as well as daily profiles of the French electricity power demand

arXiv.org e-Print Archive

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

HAL-Rennes 1

Classification supervisée en grande dimension. Application à l'agrément de conduite automobile

Author: Poggi Jean-Michel
Tuleau Christine
Publication venue: Société française de statistique
Publication date: 01/01/2006
Field of study

This work is motivated by a real work problem: objectivization. It consists in explaining the subjective drivability using physical criteria coming from signals measured during experiments. We suggest an approach for the discriminant variables selection trying to take advantage of the functional nature of the data. The porblem is ill-posed, since the number of explanatory variables is hugely greater than the sample size. The strategy proceeds in three steps: a signal preprocessing including wavelet denoising and synchronization, dimensionality reduction by compression using a common wavelet basis, and finally the selection of useful variables using a stepwise strategy involving successive applications of the CART method

INRIA a CCSD electronic archive server

Numérisation de Documents Anciens Mathématiques

Arbres CART et Forêts aléatoires,Importance et sélection de variables

Author: Genuer Robin
Poggi Jean-Michel
Publication venue: HAL CCSD
Publication date: 16/01/2017
Field of study

Two algorithms proposed by Leo Breiman : CART trees (Classification And Regression Trees for) introduced in the first half of the 80s and random forests emerged, meanwhile, in the early 2000s, are the subject of this article. The goal is to provide each of the topics, a presentation, a theoretical guarantee, an example and some variants and extensions. After a preamble, introduction recalls objectives of classification and regression problems before retracing some predecessors of the Random Forests. Then, a section is devoted to CART trees then random forests are presented. Then, a variable selection procedure based on permutation variable importance is proposed. Finally the adaptation of random forests to the Big Data context is sketched.Deux des algorithmes proposés par Leo Breiman : les arbres CART (pour Classification And Regression Trees) introduits dans la première moitié des années 80 et les forêts aléatoires apparues, quant à elles, au début des années 2000, font l'objet de cet article. L'objectif est de proposer sur chacun des thèmes abordés, un exposé, une garantie théorique, un exemple et signaler variantes et extensions. Après un préambule, l'introduction rappelle les objectifs des problèmes de classification et de régression avant de retracer quelques prédécesseurs des forêts aléatoires. Ensuite, une section est consa-crée aux arbres CART puis les forêts aléatoires sont présentées. Ensuite, une procédure de sélection de variables basée sur la quantification de l'importance des variables est proposée. Enfin l'adaptation des forêts aléatoires au contexte du Big Data est esquissée

INRIA a CCSD electronic archive server

HAL Descartes

Aperçu sur les formations quaternaires des feuilles Saint-Claude et Moirans-en-Montagne au 50 000e

Author: Avenard Jean-Michel
Le Bourdiec Françoise
Poggi C.
Steib J.
Tricart J.
Publication venue
Publication date: 01/01/1961
Field of study

Horizon / Pleins textes

Forêts aléatoires : remarques méthodologiques

Author: Genuer Robin
Poggi Jean-Michel
Tuleau Christine
Publication venue: HAL CCSD
Publication date: 01/01/2009
Field of study

International audienceOn s'intéresse à la méthode des forêts aléatoires d'un point de vue méthodologique. Introduite par Leo Breiman en 2001, elle est désormais largement utilisée tant en classication qu'en régression avec un succès spectaculaire. On vise tout d'abord à confirmer les résultats expérimentaux, connus mais épars, quant au choix des paramètres de la méthode, tant pour les problèmes dits "standards" que pour ceux dits de "grande dimension" (pour lesquels le nombre de variables est très grand vis à vis du nombre d'observations). Mais la contribution principale de cet article est d'étudier le comportement du score d'importance des variables basé sur les forêts aléatoires et d'examiner deux problèmes classiques de sélection de variables. Le premier est de dégager les variables importantes à des fins d'interprétation tandis que le second, plus restrictif, vise à se restreindre à un sous-ensemble suffisant pour la prédiction. La stratégie générale procède en deux étapes : le classement des variables basé sur les scores d'importance suivi d'une procédure d'introduction ascendante séquentielle des variables

HAL-UNICE

INRIA a CCSD electronic archive server

Clustering electricity consumers using high-dimensional regression mixture models

Author: Devijver Emilie
Goude Yannig
Poggi Jean-Michel
Publication venue: HAL CCSD
Publication date: 29/06/2015
Field of study

Massive informations about individual (household, small and medium enterprise) consumption are now provided with new metering technologies and the smart grid. Two major exploitations of these data are load profiling and forecasting at different scales on the grid. Customer segmentation based on load classification is a natural approach for these purposes. We propose here a new methodology based on mixture of high-dimensional regression models. The novelty of our approach is that we focus on uncovering classes or clusters corresponding to different regression models. As a consequence, these classes could then be exploited for profiling as well as forecasting in each class or for bottom-up forecasts in a unified view. We consider a real dataset of Irish individual consumers of 4,225 meters, each with 48 half-hourly meter reads per day over 1 year: from 1st January 2010 up to 31st December 2010, to demonstrate the feasibility of our approach

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server