16 research outputs found

    Error rate estimates for the <i>binary null case study (balanced)</i>.

    No full text
    <p>Shown are different error rate estimates for the setting with two response classes of equal size and without any predictors with effect. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with different sample sizes, <i>n</i>, and numbers of predictors, <i>p</i>. The mean error rate over 500 repetitions was obtained for a range of <i>mtry</i> values. The vertical grey dashed line in each plot indicates the most commonly used default choice for <i>mtry</i> in classification tasks, that is .</p

    Overview over high-dimensional genomic data sets.

    No full text
    <p>Overview over high-dimensional genomic data sets.</p

    Error rate estimates for simulation studies with many predictors with effect and <i>n</i> = 20.

    No full text
    <p>Shown are different error rate estimates for an additional simulation study with two response classes of equal size and many predictor variables with effect. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with sample size <i>n</i> = 20 and different numbers of predictors, <i>p</i>. The mean error rate over 500 repetitions was obtained for a range of <i>mtry</i> values. The vertical grey dashed line in each plot indicates the most commonly used default choice for <i>mtry</i> in classification tasks, that is .</p

    On the overestimation of random forest’s out-of-bag error

    No full text
    <div><p>The ensemble method random forests has become a popular classification tool in bioinformatics and related fields. The out-of-bag error is an error estimation technique often used to evaluate the accuracy of a random forest and to select appropriate values for tuning parameters, such as the number of candidate predictors that are randomly drawn for a split, referred to as <i>mtry</i>. However, for binary classification problems with metric predictors it has been shown that the out-of-bag error can overestimate the true prediction error depending on the choices of random forests parameters. Based on simulated and real data this paper aims to identify settings for which this overestimation is likely. It is, moreover, questionable whether the out-of-bag error can be used in classification tasks for selecting tuning parameters like <i>mtry</i>, because the overestimation is seen to depend on the parameter <i>mtry</i>. The simulation-based and real-data based studies with metric predictor variables performed in this paper show that the overestimation is largest in balanced settings and in settings with few observations, a large number of predictor variables, small correlations between predictors and weak effects. There was hardly any impact of the overestimation on tuning parameter selection. However, although the prediction performance of random forests was not substantially affected when using the out-of-bag error for tuning parameter selection in the present studies, one cannot be sure that this applies to all future data. For settings with metric predictor variables it is therefore strongly recommended to use stratified subsampling with sampling fractions that are proportional to the class sizes for both tuning parameter selection and error estimation in random forests. This yielded less biased estimates of the true prediction error. In unbalanced settings, in which there is a strong interest in predicting observations from the smaller classes well, sampling the same number of observations from each class is a promising alternative.</p></div

    Error rate estimates for the <i>binary null case study (unbalanced)</i>.

    No full text
    <p>Shown are different error rate estimates for the setting with two response classes of unequal size (smaller class containing 30% of the observations) and without any predictors with effect. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with different sample sizes, <i>n</i>, and numbers of predictors, <i>p</i>. The mean error rate over 500 repetitions was obtained for a range of <i>mtry</i> values. The vertical grey dashed line in each plot indicates the most commonly used default choice for <i>mtry</i> in classification tasks, that is .</p

    Class imbalance in subsamples drawn from a balanced original sample.

    No full text
    <p>Distribution of the frequency of class 1 observations in subsamples of size ⌊0.632<i>n</i>⌋, randomly drawn from a balanced sample with a total of (a) <i>n</i> = 1000, (b) <i>n</i> = 100, and (c) <i>n</i> = 20, observations from classes 1 and 2.</p

    Error rate estimates for the <i>real data null case study without correlations</i>.

    No full text
    <p>Shown are different error rate estimates for studies based on six real data sets with uncorrelated predictors and two or three response classes, respectively, of nearly the same size. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with different sample sizes, <i>n</i>, and numbers of predictors, <i>p</i>. The mean error rate over 1000 repetitions was obtained for a range of <i>mtry</i> values. The vertical grey dashed line in each plot indicates the most commonly used default choice for <i>mtry</i> in classification tasks, that is .</p

    The effect of the bias of OOB error on RF’s performance when used for <i>mtry</i> selection.

    No full text
    <p>Performance of RF classifiers when <i>mtry</i> was selected based on the OOB error, the stratified OOB error, the unstratified CV error and the stratified CV error for the additional simulation studies with many variables with effect. The performance of RF was measured using a large independent test data set.</p

    The trees’ preference for predicting the larger class in dependence on <i>mtry</i>.

    No full text
    <p>Fraction of class 1 (minority class in training sample) predictions obtained for balanced test samples with 5000 observations, each from class 1 and 2, and <i>p</i> = 100 (null case setting). Predictions were obtained by RFs with specific <i>mtry</i> (<i>x</i>-axis). RFs were trained on <i>n</i> = 30 observations (10 from class 1 and 20 from class 2) with <i>p</i> = 100. Results are shown for 500 repetitions.</p

    Error rate estimates for the <i>real data study</i>.

    No full text
    <p>Shown are different error rate estimates for six real data sets with two or three response classes, respectively, of nearly the same size. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with different sample sizes, <i>n</i>, and numbers of predictors, <i>p</i>. The mean error rate over 1000 repetitions was obtained for a range of <i>mtry</i> values. The vertical grey dashed line in each plot indicates the most commonly used default choice for <i>mtry</i> in classification tasks, that is .</p
    corecore