105,004 research outputs found
Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization
Undetected overfitting can occur when there are significant redundancies
between training and validation data. We describe AVE, a new measure of
training-validation redundancy for ligand-based classification problems that
accounts for the similarity amongst inactive molecules as well as active. We
investigated seven widely-used benchmarks for virtual screening and
classification, and show that the amount of AVE bias strongly correlates with
the performance of ligand-based predictive methods irrespective of the
predicted property, chemical fingerprint, similarity measure, or
previously-applied unbiasing techniques. Therefore, it may be that the
previously-reported performance of most ligand-based methods can be explained
by overfitting to benchmarks rather than good prospective accuracy
Technical note: Bias and the quantification of stability
Research on bias in machine learning algorithms has generally been concerned with the
impact of bias on predictive accuracy. We believe that there are other factors that should
also play a role in the evaluation of bias. One such factor is the stability of the algorithm;
in other words, the repeatability of the results. If we obtain two sets of data from the same
phenomenon, with the same underlying probability distribution, then we would like our
learning algorithm to induce approximately the same concepts from both sets of data. This
paper introduces a method for quantifying stability, based on a measure of the agreement
between concepts. We also discuss the relationships among stability, predictive accuracy,
and bias
Optimizing 0/1 Loss for Perceptrons by Random Coordinate Descent
The 0/1 loss is an important cost function for perceptrons. Nevertheless it cannot be easily minimized by most existing perceptron learning algorithms. In this paper, we propose a family of random coordinate descent algorithms to directly minimize the 0/1 loss for perceptrons, and prove their convergence. Our algorithms are computationally efficient, and usually achieve the lowest 0/1 loss compared with other algorithms. Such advantages make them favorable for nonseparable real-world problems. Experiments show that our algorithms are especially useful for ensemble learning, and could achieve the lowest test error for many complex data sets when coupled with AdaBoost
- …