In this paper we study techniques for generating and evaluating
confidence bands on ROC curves. ROC curve evaluation is
rapidly becoming a commonly used evaluation metric in machine
learning, although evaluating ROC curves has thus far been limited
to studying the area under the curve (AUC) or generation of
one-dimensional confidence intervals by freezing one variableâ
the false-positive rate, or threshold on the classification scoring
function. Researchers in the medical field have long been using
ROC curves and have many well-studied methods for analyzing
such curves, including generating confidence intervals as
well as simultaneous confidence bands. In this paper we introduce
these techniques to the machine learning community and
show their empirical fitness on the Covertype data setâa standard
machine learning benchmark from the UCI repository. We
show how some of these methods work remarkably well, others
are too loose, and that existing machine learning methods for generation
of 1-dimensional confidence intervals do not translate well
to generation of simultaneous bandsâtheir bands are too tight.Information Systems Working Papers Serie