One of the most important assumptions made by many classification algorithms is that the training and test sets are drawn from the same distribution, i.e., the so-called “stationary distribution assumption” that the future and the past data sets are identical from a probabilistic standpoint. In many domains of real-world applications, such as marketing solicitation, fraud detection, drug testing, loan approval, sub-population surveys, school enrollment among others, this is rarely the case. This is because the only labeled sample available for training is biased in different ways due to a variety of practical reasons and limitations. In these circumstances, traditional methods to evaluate the expected generalization error of classification algorithms, such as structural risk minimization, ten-fold cross-validation, and leave-one-out validation, usually return poor estimates of which classification algorithm, when trained on biased dataset, will be the most accurate for future unbiased dataset, among a number of competing candidates. Sometimes, the estimated order of the learning algorithms ’ accuracy could be so poor that it is not even better than random guessing. Therefore, a method to determine the most accurate learner is needed for data mining under sample selection bias for many real-world applications. We present such an approach that can determine which learner will perform the best on an unbiased test set, given a possibly biased training set, in a fraction of the computational cost to use cross-validation based approaches
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.