31 research outputs found
Frugal Optimization for Cost-related Hyperparameters
The increasing demand for democratizing machine learning algorithms calls for
hyperparameter optimization (HPO) solutions at low cost. Many machine learning
algorithms have hyperparameters which can cause a large variation in the
training cost. But this effect is largely ignored in existing HPO methods,
which are incapable to properly control cost during the optimization process.
To address this problem, we develop a new cost-frugal HPO solution. The core of
our solution is a simple but new randomized direct-search method, for which we
prove a convergence rate of and an
-approximation guarantee on the total cost. We provide
strong empirical results in comparison with state-of-the-art HPO methods on
large AutoML benchmarks.Comment: 29 pages (including supplementary appendix
OpenDataVal: a Unified Benchmark for Data Valuation
Assessing the quality and impact of individual data points is critical for
improving model performance and mitigating undesirable biases within the
training dataset. Several data valuation algorithms have been proposed to
quantify data quality, however, there lacks a systemic and standardized
benchmarking system for data valuation. In this paper, we introduce
OpenDataVal, an easy-to-use and unified benchmark framework that empowers
researchers and practitioners to apply and compare various data valuation
algorithms. OpenDataVal provides an integrated environment that includes (i) a
diverse collection of image, natural language, and tabular datasets, (ii)
implementations of eleven different state-of-the-art data valuation algorithms,
and (iii) a prediction model API that can import any models in scikit-learn.
Furthermore, we propose four downstream machine learning tasks for evaluating
the quality of data values. We perform benchmarking analysis using OpenDataVal,
quantifying and comparing the efficacy of state-of-the-art data valuation
approaches. We find that no single algorithm performs uniformly best across all
tasks, and an appropriate algorithm should be employed for a user's downstream
task. OpenDataVal is publicly available at https://opendataval.github.io with
comprehensive documentation. Furthermore, we provide a leaderboard where
researchers can evaluate the effectiveness of their own data valuation
algorithms.Comment: 25 pages, NeurIPS 2023 Track on Datasets and Benchmark
Scalable Nonlinear Learning with Adaptive Polynomial Expansions
Can we effectively learn a nonlinear representation in time comparable to
linear learning? We describe a new algorithm that explicitly and adaptively
expands higher-order interaction features over base linear representations. The
algorithm is designed for extreme computational efficiency, and an extensive
experimental study shows that its computation/prediction tradeoff ability
compares very favorably against strong baselines.Comment: To appear in NIPS 201
OpenFE: Automated Feature Generation beyond Expert-level Performance
The goal of automated feature generation is to liberate machine learning
experts from the laborious task of manual feature generation, which is crucial
for improving the learning performance of tabular data. The major challenge in
automated feature generation is to efficiently and accurately identify useful
features from a vast pool of candidate features. In this paper, we present
OpenFE, an automated feature generation tool that provides competitive results
against machine learning experts. OpenFE achieves efficiency and accuracy with
two components: 1) a novel feature boosting method for accurately estimating
the incremental performance of candidate features. 2) a feature-scoring
framework for retrieving effective features from a large number of candidates
through successive featurewise halving and feature importance attribution.
Extensive experiments on seven benchmark datasets show that OpenFE outperforms
existing baseline methods. We further evaluate OpenFE in two famous Kaggle
competitions with thousands of data science teams participating. In one of the
competitions, features generated by OpenFE with a simple baseline model can
beat 99.3\% data science teams. In addition to the empirical results, we
provide a theoretical perspective to show that feature generation is beneficial
in a simple yet representative setting. The code is available at
https://github.com/ZhangTP1996/OpenFE.Comment: 23 pages, 3 figure
Tiny Classifier Circuits: Evolving Accelerators for Tabular Data
A typical machine learning (ML) development cycle for edge computing is to
maximise the performance during model training and then minimise the
memory/area footprint of the trained model for deployment on edge devices
targeting CPUs, GPUs, microcontrollers, or custom hardware accelerators. This
paper proposes a methodology for automatically generating predictor circuits
for classification of tabular data with comparable prediction performance to
conventional ML techniques while using substantially fewer hardware resources
and power. The proposed methodology uses an evolutionary algorithm to search
over the space of logic gates and automatically generates a classifier circuit
with maximised training prediction accuracy. Classifier circuits are so tiny
(i.e., consisting of no more than 300 logic gates) that they are called "Tiny
Classifier" circuits, and can efficiently be implemented in ASIC or on an FPGA.
We empirically evaluate the automatic Tiny Classifier circuit generation
methodology or "Auto Tiny Classifiers" on a wide range of tabular datasets, and
compare it against conventional ML techniques such as Amazon's AutoGluon,
Google's TabNet and a neural search over Multi-Layer Perceptrons. Despite Tiny
Classifiers being constrained to a few hundred logic gates, we observe no
statistically significant difference in prediction performance in comparison to
the best-performing ML baseline. When synthesised as a Silicon chip, Tiny
Classifiers use 8-18x less area and 4-8x less power. When implemented as an
ultra-low cost chip on a flexible substrate (i.e., FlexIC), they occupy 10-75x
less area and consume 13-75x less power compared to the most hardware-efficient
ML baseline. On an FPGA, Tiny Classifiers consume 3-11x fewer resources.Comment: 14 pages, 16 figure
Learning of classification models from group-based feedback
Learning of classification models in practice often relies on a nontrivial amount of human annotation effort. The most widely adopted human labeling process assigns class labels to individual data instances. However, such a process is very rigid and may end up being very time-consuming and costly to conduct in practice. Finding more effective ways to reduce human annotation effort has become critical for building machine learning systems that require human feedback.
In this thesis, we propose and investigate a new machine learning approach - Group-Based Active Learning - to learn classification models from limited human feedback. A group is defined by a set of instances represented by conjunctive patterns that are value ranges over the input features. Such conjunctive patterns define hypercubic regions of the input data space. A human annotator assesses the group solely based on its region-based description by providing an estimate of the class proportion for the subpopulation covered by the region. The advantage of this labeling process is that it allows a human to label many instances at the same time, which can, in turn, improve the labeling efficiency.
In general, there are infinitely many regions one can define over a real-valued input space. To identify and label groups/regions important for classification learning, we propose and develop a Hierarchical Active Learning framework that actively builds and labels a hierarchy of input regions. Briefly, our framework starts by identifying general regions covering substantial portions of the input data space. After that, it progressively splits the regions into smaller and smaller sub-regions and also acquires class proportion labels for the new regions. The proportion labels for these regions are used to gradually improve and refine a classification model induced by the regions. We develop three versions of the idea. The first two versions aim to build a single hierarchy of regions. One builds it statically using hierarchical clustering, while the other one builds it dynamically, similarly to the decision tree learning process. The third approach builds multiple hierarchies simultaneously, and it offers additional flexibility for identifying more informative and simpler regions. We have conducted comprehensive empirical studies to evaluate our framework. The results show that the methods based on the region-based active learning can learn very good classifiers from a very few and simple region queries, and hence are promising for reducing human annotation effort needed for building a variety of classification models