Modern machine learning techniques can be used to construct powerful models
for difficult collider physics problems. In many applications, however, these
models are trained on imperfect simulations due to a lack of truth-level
information in the data, which risks the model learning artifacts of the
simulation. In this paper, we introduce the paradigm of classification without
labels (CWoLa) in which a classifier is trained to distinguish statistical
mixtures of classes, which are common in collider physics. Crucially, neither
individual labels nor class proportions are required, yet we prove that the
optimal classifier in the CWoLa paradigm is also the optimal classifier in the
traditional fully-supervised case where all label information is available.
After demonstrating the power of this method in an analytical toy example, we
consider a realistic benchmark for collider physics: distinguishing quark-
versus gluon-initiated jets using mixed quark/gluon training samples. More
generally, CWoLa can be applied to any classification problem where labels or
class proportions are unknown or simulations are unreliable, but statistical
mixtures of the classes are available.Comment: 18 pages, 5 figures; v2: intro extended and references added; v3:
additional discussion to match JHEP versio