research

The interaction of sampling ratio and modelling method in prediction of binary target with rare target class

Abstract

In many practical predictive data mining problems with a binary target, one of the target classes is rare. In such a situation it is common practice to decrease the ratio of common to rare class cases in the training set by under-sampling the common class. The relationship between the ratio of common to rare class cases in the training set and model performance was investigated empirically on three artificial and three real-world data sets. The results indicated that a flexible modelling method without regularisation benefits in both mean and variance of performance from a larger ratio when evaluated on a criterion sensitive to overfitting, and benefits in mean but not variance of performance when evaluated on a criterion less sensitive to overfitting. For an inflexible modelling method and a flexible method with regularisation, the effects of a larger ratio were less consistent. In no circumstances, however, was a larger ratio found to be detrimental to model performance, however measured

    Similar works