Search CORE

195 research outputs found

An AUC-based Permutation Variable Importance Measure for Random Forests

Author: A Estabrooks
AL Boulesteix
AL Boulesteix
Anne-Laure Boulesteix
C Chen
C Liu
C Strobl
Carolin Strobl
F Briggs
G Batista
J Chang
J Van Hulse
J Van Hulse
K Nicodemus
KK Nicodemus
KK Nicodemus
KK Nicodemus
L Breiman
M Calle
M Cummings
M Khalilia
M Kubat
M Pepe
N Japkowicz
R Blagus
Silke Janitza
T Fawcett
T Hothorn
T Hothorn
T Khoshgoftaar
WJ Lin
Y Huang
Y Sun
Y Xie
Publication venue
Publication date: 01/11/2012
Field of study

The random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class imbalance. We investigated the performance of the standard permutation VIM and of our novel AUC-based permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the standard permutation VIM loses its ability to discriminate between associated predictors and predictors not associated with the response for increasing class imbalance. It is outperformed by our new AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF variant based on conditional inference trees. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html

CiteSeerX

Crossref

Springer - Publisher Connector

Open Access LMU

PubMed Central

ZORA

Ensemble Sales Forecasting Study in Semiconductor Industry

Author: A Goodenough
C Jain
CL Giles
G Box
G James
GK Groff
J Boulden
JH Friedman
K Ferreira
L Breiman
M Yuan
P Winters
R Killick
R Sharma
W Wong
Publication venue
Publication date: 16/05/2017
Field of study

Sales forecasting plays a prominent role in business planning and business strategy. The value and importance of advance information is a cornerstone of planning activity, and a well-set forecast goal can guide sale-force more efficiently. In this paper CPU sales forecasting of Intel Corporation, a multinational semiconductor industry, was considered. Past sale, future booking, exchange rates, Gross domestic product (GDP) forecasting, seasonality and other indicators were innovatively incorporated into the quantitative modeling. Benefit from the recent advances in computation power and software development, millions of models built upon multiple regressions, time series analysis, random forest and boosting tree were executed in parallel. The models with smaller validation errors were selected to form the ensemble model. To better capture the distinct characteristics, forecasting models were implemented at lead time and lines of business level. The moving windows validation process automatically selected the models which closely represent current market condition. The weekly cadence forecasting schema allowed the model to response effectively to market fluctuation. Generic variable importance analysis was also developed to increase the model interpretability. Rather than assuming fixed distribution, this non-parametric permutation variable importance analysis provided a general framework across methods to evaluate the variable importance. This variable importance framework can further extend to classification problem by modifying the mean absolute percentage error(MAPE) into misclassify error. Please find the demo code at : https://github.com/qx0731/ensemble_forecast_methodsComment: 14 pages, Industrial Conference on Data Mining 2017 (ICDM 2017

arXiv.org e-Print Archive

Crossref

Nonparametric Variable Selection Using Machine Learning Algorithms in High Dimensional (Large P, Small N) Biomedical Applications

Author: Christina M.R. Kitchen
Publication venue: 'IntechOpen'
Publication date: 08/01/2011
Field of study

IntechOpen

A computationally fast variable importance test for random forests for high-dimensional data

Author: Boulesteix Anne-Laure
Celik Ender
Janitza Silke
Publication venue
Publication date: 22/10/2015
Field of study

Random forests are a commonly used tool for classification with high-dimensional data as well as for ranking candidate predictors based on the so-called variable importance measures. There are different importance measures for ranking predictor variables, the two most common measures are the Gini importance and the permutation importance. The latter has been found to be more reliable than the Gini importance. It is computed from the change in prediction accuracy when removing any association between the response and a predictor variable, with large changes indicating that the predictor variable is important. A drawback of those variable importance measures is that there is no natural cutoff that can be used to discriminate between important and non-important variables. Several approaches, for example approaches based on hypothesis testing, have been developed for addressing this problem. The existing testing approaches are permutation-based and require the repeated computation of forests. While for low-dimensional settings those permutation-based approaches might be computationally tractable, for high-dimensional settings typically including thousands of candidate predictors, computing time is enormous. A new computationally fast heuristic procedure of a variable importance test is proposed, that is appropriate for high-dimensional data where many variables do not carry any information. The testing approach is based on a modified version of the permutation variable importance measure, which is inspired by cross-validation procedures. The novel testing approach is tested and compared to the permutation-based testing approach of Altmann and colleagues using studies on complex high-dimensional binary classification settings. The new approach controlled the type I error and had at least comparable power at a substantially smaller computation time in our studies. The new variable importance test is implemented in the R package vita

Open Access LMU

An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests

Author: Malley James
Strobl Carolin
Tutz Gerhard
Publication venue
Publication date: 01/04/2009
Field of study

Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years. High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications, and provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions. The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application. Application of the methods is illustrated using freely available implementations in the R system for statistical computing

Crossref

Open Access LMU

PubMed Central

Optimized metabotype definition based on a limited number of standard clinical parameters in the population-based KORA study

Author: Breuninger Taylor A.
Dahal Chetana
Hauner Hans
Koenig Wolfgang
Linseisen Jakob
Meisinger Christa
Peters Annette
Rathmann Wolfgang
Thorand Barbara
Wawro Nina
Publication venue: 'MDPI AG'
Publication date: 01/01/2022
Field of study

The aim of metabotyping is to categorize individuals into metabolically similar groups. Earlier studies that explored metabotyping used numerous parameters, which made it less transferable to apply. Therefore, this study aimed to identify metabotypes based on a set of standard laboratory parameters that are regularly determined in clinical practice. K-means cluster analysis was used to group 3001 adults from the KORA F4 cohort into three clusters. We identified the clustering parameters through variable importance methods, without including any specific disease endpoint. Several unique combinations of selected parameters were used to create different metabotype models. Metabotype models were then described and evaluated, based on various metabolic parameters and on the incidence of cardiometabolic diseases. As a result, two optimal models were identified: a model composed of five parameters, which were fasting glucose, HDLc, non-HDLc, uric acid, and BMI (the metabolic disease model) for clustering; and a model that included four parameters, which were fasting glucose, HDLc, non-HDLc, and triglycerides (the cardiovascular disease model). These identified metabotypes are based on a few common parameters that are measured in everyday clinical practice. These metabotypes are cost-effective, and can be easily applied on a large scale in order to identify specific risk groups that can benefit most from measures to prevent cardiometabolic diseases, such as dietary recommendations and lifestyle interventions

Multidisciplinary Digital Publishing Institute

OPUS Augsburg

Directory of Open Access Journals

PubMed Central

PuSH

Machine Learning Methods with Noisy, Incomplete or Small Datasets

Author: Caiafa César Federico
Marti Puig Pere
Solé Casals Jordi
Tanaka Toshihisa
Zhe Sun
Publication venue: 'MDPI AG'
Publication date: 01/04/2021
Field of study

In this article, we present a collection of fifteen novel contributions on machine learning methods with low-quality or imperfect datasets, which were accepted for publication in the special issue “Machine Learning Methods with Noisy, Incomplete or Small Datasets”, Applied Sciences (ISSN 2076-3417). These papers provide a variety of novel approaches to real-world machine learning problems where available datasets suffer from imperfections such as missing values, noise or artefacts. Contributions in applied sciences include medical applications, epidemic management tools, methodological work, and industrial applications, among others. We believe that this special issue will bring new ideas for solving this challenging problem, and will provide clear examples of application in real-world scenarios.Fil: Caiafa, César Federico. Provincia de Buenos Aires. Gobernación. Comisión de Investigaciones Científicas. Instituto Argentino de Radioastronomía. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata. Instituto Argentino de Radioastronomía; ArgentinaFil: Zhe, Sun. Lab. Adaptive Intelligence - Riken; JapónFil: Tanaka, Toshihisa. Tokyo University of Agriculture and Technology; JapónFil: Marti Puig, Pere. University of Vic; EspañaFil: Solé Casals, Jordi. University of Vic; Españ

Multidisciplinary Digital Publishing Institute

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

CONICET Digital

Servicio de Difusión de la Creación Intelectual

ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R

Author: Wright Marvin N.
Ziegler Andreas
Publication venue: 'Foundation for Open Access Statistic'
Publication date: 01/03/2017
Field of study

We introduce the C++ application and R package ranger. The software is a fast implementation of random forests for high dimensional data. Ensembles of classification, regression and survival trees are supported. We describe the implementation, provide examples, validate the package with a reference implementation, and compare runtime and memory usage with other implementations. The new software proves to scale best with the number of features, samples, trees, and features tried for splitting. Finally, we show that ranger is the fastest and most memory efficient implementation of random forests to analyze data on the scale of a genome-wide association study

arXiv.org e-Print Archive

Directory of Open Access Journals

Journal of Statistical Software