2 research outputs found
Data Programming using Continuous and Quality-Guided Labeling Functions
Scarcity of labeled data is a bottleneck for supervised learning models. A
paradigm that has evolved for dealing with this problem is data programming. An
existing data programming paradigm allows human supervision to be provided as a
set of discrete labeling functions (LF) that output possibly noisy labels to
input instances and a generative modelfor consolidating the weak labels. We
enhance and generalize this paradigm by supporting functions that output a
continuous score (instead of a hard label) that noisily correlates with labels.
We show across five applications that continuous LFs are more natural to
program and lead to improved recall. We also show that accuracy of existing
generative models is unstable with respect to initialization, training epochs,
and learning rates. We give control to the data programmer to guide the
training process by providing intuitive quality guides with each LF. We propose
an elegant method of incorporating these guides into the generative model. Our
overall method, called CAGE, makes the data programming paradigm more reliable
than other tricks based on initialization, sign-penalties, or soft-accuracy
constraints.Comment: Accepted paper at the 34th AAAI Conference on Artificial Intelligence
(AAAI-18), New York, US
Knodle: Modular Weakly Supervised Learning with PyTorch
Strategies for improving the training and prediction quality of weakly
supervised machine learning models vary in how much they are tailored to a
specific task or integrated with a specific model architecture. In this work,
we introduce Knodle, a software framework that treats weak data annotations,
deep learning models, and methods for improving weakly supervised training as
separate, modular components. This modularization gives the training process
access to fine-grained information such as data set characteristics, matches of
heuristic rules, or elements of the deep learning model ultimately used for
prediction. Hence, our framework can encompass a wide range of training methods
for improving weak supervision, ranging from methods that only look at
correlations of rules and output classes (independently of the machine learning
model trained with the resulting labels), to those that harness the interplay
of neural networks and weakly labeled data. We illustrate the benchmarking
potential of the framework with a performance comparison of several reference
implementations on a selection of datasets that are already available in
Knodle.
The framework is published as an open-source Python package knodle and
available at https://github.com/knodle/knodle