4 research outputs found
Good Data from Bad Models : Foundations of Threshold-based Auto-labeling
Creating large-scale high-quality labeled datasets is a major bottleneck in
supervised machine learning workflows. Auto-labeling systems are a promising
way to reduce reliance on manual labeling for dataset construction.
Threshold-based auto-labeling, where validation data obtained from humans is
used to find a threshold for confidence above which the data is
machine-labeled, is emerging as a popular solution used widely in practice.
Given the long shelf-life and diverse usage of the resulting datasets,
understanding when the data obtained by such auto-labeling systems can be
relied on is crucial. In this work, we analyze threshold-based auto-labeling
systems and derive sample complexity bounds on the amount of human-labeled
validation data required for guaranteeing the quality of machine-labeled data.
Our results provide two insights. First, reasonable chunks of the unlabeled
data can be automatically and accurately labeled by seemingly bad models.
Second, a hidden downside of threshold-based auto-labeling systems is
potentially prohibitive validation data usage. Together, these insights
describe the promise and pitfalls of using such systems. We validate our
theoretical guarantees with simulations and study the efficacy of
threshold-based auto-labeling on real datasets
Improving deep forest by confidence screening
Most studies about deep learning are based on neural network models, where many layers of parameterized nonlinear differentiable modules are trained by backpropagation. Recently, it has been shown that deep learning can also be realized by non-differentiable modules without backpropagation training called deep forest. The developed representation learning process is based on a cascade of cascades of decision tree forests, where the high memory requirement and the high time cost inhibit the training of large models. In this paper, we propose a simple yet effective approach to improve the efficiency of deep forest. The key idea is to pass the instances with high confidence directly to the final stage rather than passing through all the levels. We also provide a theoretical analysis suggesting a means to vary the model complexity from low to high as the level increases in the cascade, which further reduces the memory requirement and time cost. Our experiments show that the proposed approach achieves highly competitive predictive performance with significantly reduced time cost and memory requirement by up to one order of magnitude