45,485 research outputs found
Co-training for Demographic Classification Using Deep Learning from Label Proportions
Deep learning algorithms have recently produced state-of-the-art accuracy in
many classification tasks, but this success is typically dependent on access to
many annotated training examples. For domains without such data, an attractive
alternative is to train models with light, or distant supervision. In this
paper, we introduce a deep neural network for the Learning from Label
Proportion (LLP) setting, in which the training data consist of bags of
unlabeled instances with associated label distributions for each bag. We
introduce a new regularization layer, Batch Averager, that can be appended to
the last layer of any deep neural network to convert it from supervised
learning to LLP. This layer can be implemented readily with existing deep
learning packages. To further support domains in which the data consist of two
conditionally independent feature views (e.g. image and text), we propose a
co-training algorithm that iteratively generates pseudo bags and refits the
deep LLP model to improve classification accuracy. We demonstrate our models on
demographic attribute classification (gender and race/ethnicity), which has
many applications in social media analysis, public health, and marketing. We
conduct experiments to predict demographics of Twitter users based on their
tweets and profile image, without requiring any user-level annotations for
training. We find that the deep LLP approach outperforms baselines for both
text and image features separately. Additionally, we find that co-training
algorithm improves image and text classification by 4% and 8% absolute F1,
respectively. Finally, an ensemble of text and image classifiers further
improves the absolute F1 measure by 4% on average
Classification without labels: Learning from mixed samples in high energy physics
Modern machine learning techniques can be used to construct powerful models
for difficult collider physics problems. In many applications, however, these
models are trained on imperfect simulations due to a lack of truth-level
information in the data, which risks the model learning artifacts of the
simulation. In this paper, we introduce the paradigm of classification without
labels (CWoLa) in which a classifier is trained to distinguish statistical
mixtures of classes, which are common in collider physics. Crucially, neither
individual labels nor class proportions are required, yet we prove that the
optimal classifier in the CWoLa paradigm is also the optimal classifier in the
traditional fully-supervised case where all label information is available.
After demonstrating the power of this method in an analytical toy example, we
consider a realistic benchmark for collider physics: distinguishing quark-
versus gluon-initiated jets using mixed quark/gluon training samples. More
generally, CWoLa can be applied to any classification problem where labels or
class proportions are unknown or simulations are unreliable, but statistical
mixtures of the classes are available.Comment: 18 pages, 5 figures; v2: intro extended and references added; v3:
additional discussion to match JHEP versio
- …