4 research outputs found

    Good Data from Bad Models : Foundations of Threshold-based Auto-labeling

    Full text link
    Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Auto-labeling systems are a promising way to reduce reliance on manual labeling for dataset construction. Threshold-based auto-labeling, where validation data obtained from humans is used to find a threshold for confidence above which the data is machine-labeled, is emerging as a popular solution used widely in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. In this work, we analyze threshold-based auto-labeling systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two insights. First, reasonable chunks of the unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of threshold-based auto-labeling systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with simulations and study the efficacy of threshold-based auto-labeling on real datasets

    Improving deep forest by confidence screening

    Get PDF
    Most studies about deep learning are based on neural network models, where many layers of parameterized nonlinear differentiable modules are trained by backpropagation. Recently, it has been shown that deep learning can also be realized by non-differentiable modules without backpropagation training called deep forest. The developed representation learning process is based on a cascade of cascades of decision tree forests, where the high memory requirement and the high time cost inhibit the training of large models. In this paper, we propose a simple yet effective approach to improve the efficiency of deep forest. The key idea is to pass the instances with high confidence directly to the final stage rather than passing through all the levels. We also provide a theoretical analysis suggesting a means to vary the model complexity from low to high as the level increases in the cascade, which further reduces the memory requirement and time cost. Our experiments show that the proposed approach achieves highly competitive predictive performance with significantly reduced time cost and memory requirement by up to one order of magnitude
    corecore