Search CORE

16 research outputs found

Error rate estimates for the binary null case study (balanced).

Author: Roman Hornung (3443030)
Silke Janitza (844807)
Publication venue
Publication date
Field of study

Shown are different error rate estimates for the setting with two response classes of equal size and without any predictors with effect. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with different sample sizes, n, and numbers of predictors, p. The mean error rate over 500 repetitions was obtained for a range of mtry values. The vertical grey dashed line in each plot indicates the most commonly used default choice for mtry in classification tasks, that is .</p

FigShare

Overview over high-dimensional genomic data sets.

Author: Roman Hornung (3443030)
Silke Janitza (844807)
Publication venue
Publication date
Field of study

Overview over high-dimensional genomic data sets.</p

FigShare

Error rate estimates for simulation studies with many predictors with effect and n = 20.

Author: Roman Hornung (3443030)
Silke Janitza (844807)
Publication venue
Publication date
Field of study

Shown are different error rate estimates for an additional simulation study with two response classes of equal size and many predictor variables with effect. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with sample size n = 20 and different numbers of predictors, p. The mean error rate over 500 repetitions was obtained for a range of mtry values. The vertical grey dashed line in each plot indicates the most commonly used default choice for mtry in classification tasks, that is .</p

FigShare

On the overestimation of random forest’s out-of-bag error

Author: Roman Hornung (3443030)
Silke Janitza (844807)
Publication venue
Publication date
Field of study

<div>The ensemble method random forests has become a popular classification tool in bioinformatics and related fields. The out-of-bag error is an error estimation technique often used to evaluate the accuracy of a random forest and to select appropriate values for tuning parameters, such as the number of candidate predictors that are randomly drawn for a split, referred to as mtry. However, for binary classification problems with metric predictors it has been shown that the out-of-bag error can overestimate the true prediction error depending on the choices of random forests parameters. Based on simulated and real data this paper aims to identify settings for which this overestimation is likely. It is, moreover, questionable whether the out-of-bag error can be used in classification tasks for selecting tuning parameters like mtry, because the overestimation is seen to depend on the parameter mtry. The simulation-based and real-data based studies with metric predictor variables performed in this paper show that the overestimation is largest in balanced settings and in settings with few observations, a large number of predictor variables, small correlations between predictors and weak effects. There was hardly any impact of the overestimation on tuning parameter selection. However, although the prediction performance of random forests was not substantially affected when using the out-of-bag error for tuning parameter selection in the present studies, one cannot be sure that this applies to all future data. For settings with metric predictor variables it is therefore strongly recommended to use stratified subsampling with sampling fractions that are proportional to the class sizes for both tuning parameter selection and error estimation in random forests. This yielded less biased estimates of the true prediction error. In unbalanced settings, in which there is a strong interest in predicting observations from the smaller classes well, sampling the same number of observations from each class is a promising alternative.</div

FigShare

Error rate estimates for the binary null case study (unbalanced).

Author: Roman Hornung (3443030)
Silke Janitza (844807)
Publication venue
Publication date
Field of study

Shown are different error rate estimates for the setting with two response classes of unequal size (smaller class containing 30% of the observations) and without any predictors with effect. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with different sample sizes, n, and numbers of predictors, p. The mean error rate over 500 repetitions was obtained for a range of mtry values. The vertical grey dashed line in each plot indicates the most commonly used default choice for mtry in classification tasks, that is .</p

FigShare

Class imbalance in subsamples drawn from a balanced original sample.

Author: Roman Hornung (3443030)
Silke Janitza (844807)
Publication venue
Publication date
Field of study

Distribution of the frequency of class 1 observations in subsamples of size ⌊0.632n⌋, randomly drawn from a balanced sample with a total of (a) n = 1000, (b) n = 100, and (c) n = 20, observations from classes 1 and 2.</p

FigShare

Error rate estimates for the real data null case study without correlations.

Author: Roman Hornung (3443030)
Silke Janitza (844807)
Publication venue
Publication date
Field of study

Shown are different error rate estimates for studies based on six real data sets with uncorrelated predictors and two or three response classes, respectively, of nearly the same size. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with different sample sizes, n, and numbers of predictors, p. The mean error rate over 1000 repetitions was obtained for a range of mtry values. The vertical grey dashed line in each plot indicates the most commonly used default choice for mtry in classification tasks, that is .</p

FigShare

The effect of the bias of OOB error on RF’s performance when used for mtry selection.

Author: Roman Hornung (3443030)
Silke Janitza (844807)
Publication venue
Publication date
Field of study

Performance of RF classifiers when mtry was selected based on the OOB error, the stratified OOB error, the unstratified CV error and the stratified CV error for the additional simulation studies with many variables with effect. The performance of RF was measured using a large independent test data set.</p

FigShare

The trees’ preference for predicting the larger class in dependence on mtry.

Author: Roman Hornung (3443030)
Silke Janitza (844807)
Publication venue
Publication date
Field of study

Fraction of class 1 (minority class in training sample) predictions obtained for balanced test samples with 5000 observations, each from class 1 and 2, and p = 100 (null case setting). Predictions were obtained by RFs with specific mtry (x-axis). RFs were trained on n = 30 observations (10 from class 1 and 20 from class 2) with p = 100. Results are shown for 500 repetitions.</p

FigShare

Error rate estimates for the real data study.

Author: Roman Hornung (3443030)
Silke Janitza (844807)
Publication venue
Publication date
Field of study

Shown are different error rate estimates for six real data sets with two or three response classes, respectively, of nearly the same size. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with different sample sizes, n, and numbers of predictors, p. The mean error rate over 1000 repetitions was obtained for a range of mtry values. The vertical grey dashed line in each plot indicates the most commonly used default choice for mtry in classification tasks, that is .</p

FigShare

Error rate estimates for the <i>binary null case study (balanced)</i>.

Overview over high-dimensional genomic data sets.

Error rate estimates for simulation studies with many predictors with effect and <i>n</i> = 20.

On the overestimation of random forest’s out-of-bag error

Error rate estimates for the <i>binary null case study (unbalanced)</i>.

Class imbalance in subsamples drawn from a balanced original sample.

Error rate estimates for the <i>real data null case study without correlations</i>.

The effect of the bias of OOB error on RF’s performance when used for <i>mtry</i> selection.

The trees’ preference for predicting the larger class in dependence on <i>mtry</i>.

Error rate estimates for the <i>real data study</i>.