1 research outputs found
On the Interaction Effects Between Prediction and Clustering
Machine learning systems increasingly depend on pipelines of multiple
algorithms to provide high quality and well structured predictions. This paper
argues interaction effects between clustering and prediction (e.g.
classification, regression) algorithms can cause subtle adverse behaviors
during cross-validation that may not be initially apparent. In particular, we
focus on the problem of estimating the out-of-cluster (OOC) prediction loss
given an approximate clustering with probabilistic error rate .
Traditional cross-validation techniques exhibit significant empirical bias in
this setting, and the few attempts to estimate and correct for these effects
are intractable on larger datasets. Further, no previous work has been able to
characterize the conditions under which these empirical effects occur, and if
they do, what properties they have. We precisely answer these questions by
providing theoretical properties which hold in various settings, and prove that
expected out-of-cluster loss behavior rapidly decays with even minor clustering
errors. Fortunately, we are able to leverage these same properties to construct
hypothesis tests and scalable estimators necessary for correcting the problem.
Empirical results on benchmark datasets validate our theoretical results and
demonstrate how scaling techniques provide solutions to new classes of
problems