Tree induction and logistic regression are two standard, off-the-shelf methods for building
models for classification. We present a large-scale experimental comparison of logistic regression
and tree induction, assessing classification accuracy and the quality of rankings based on classmembership
probabilities. We use a learning-curve analysis to examine the relationship of these
measures to the size of the training set. The results of the study show several things. (1) Contrary
to some prior observations, logistic regression does not generally outperform tree induction. (2)
More specifically, and not surprisingly, logistic regression is better for smaller training sets and tree
induction for larger data sets. Importantly, this often holds for training sets drawn from the same
domain (that is, the learning curves cross), so conclusions about induction-algorithmsuperiority on
a given domain must be based on an analysis of the learning curves. (3) Contrary to conventional
wisdom, tree induction is effective at producing probability-based rankings, although apparently
comparatively less so for a given training-set size than at making classifications. Finally, (4) the
domains on which tree induction and logistic regression are ultimately preferable can be characterized
surprisingly well by a simple measure of the separability of signal from noise.NYU, Stern School of Business, IOMS department, Center for Digital Economy Researc