125,455 research outputs found
OBOE: Collaborative Filtering for AutoML Model Selection
Algorithm selection and hyperparameter tuning remain two of the most
challenging tasks in machine learning. Automated machine learning (AutoML)
seeks to automate these tasks to enable widespread use of machine learning by
non-experts. This paper introduces OBOE, a collaborative filtering method for
time-constrained model selection and hyperparameter tuning. OBOE forms a matrix
of the cross-validated errors of a large number of supervised learning models
(algorithms together with hyperparameters) on a large number of datasets, and
fits a low rank model to learn the low-dimensional feature vectors for the
models and datasets that best predict the cross-validated errors. To find
promising models for a new dataset, OBOE runs a set of fast but informative
algorithms on the new dataset and uses their cross-validated errors to infer
the feature vector for the new dataset. OBOE can find good models under
constraints on the number of models fit or the total time budget. To this end,
this paper develops a new heuristic for active learning in time-constrained
matrix completion based on optimal experiment design. Our experiments
demonstrate that OBOE delivers state-of-the-art performance faster than
competing approaches on a test bed of supervised learning problems. Moreover,
the success of the bilinear model used by OBOE suggests that AutoML may be
simpler than was previously understood
Digging into acceptor splice site prediction : an iterative feature selection approach
Feature selection techniques are often used to reduce data dimensionality, increase classification performance, and gain insight into the processes that generated the data. In this paper, we describe an iterative procedure of feature selection and feature construction steps, improving the classification of acceptor splice sites, an important subtask of gene prediction.
We show that acceptor prediction can benefit from feature selection, and describe how feature selection techniques can be used to gain new insights in the classification of acceptor sites. This is illustrated by the identification of a new, biologically motivated feature: the AG-scanning feature.
The results described in this paper contribute both to the domain of gene prediction, and to research in feature selection techniques, describing a new wrapper based feature weighting method that aids in knowledge discovery when dealing with complex datasets
Separation of pulsar signals from noise with supervised machine learning algorithms
We evaluate the performance of four different machine learning (ML)
algorithms: an Artificial Neural Network Multi-Layer Perceptron (ANN MLP ),
Adaboost, Gradient Boosting Classifier (GBC), XGBoost, for the separation of
pulsars from radio frequency interference (RFI) and other sources of noise,
using a dataset obtained from the post-processing of a pulsar search pi peline.
This dataset was previously used for cross-validation of the SPINN-based
machine learning engine, used for the reprocessing of HTRU-S survey data
arXiv:1406.3627. We have used Synthetic Minority Over-sampling Technique
(SMOTE) to deal with high class imbalance in the dataset. We report a variety
of quality scores from all four of these algorithms on both the non-SMOTE and
SMOTE datasets. For all the above ML methods, we report high accuracy and
G-mean in both the non-SMOTE and SMOTE cases. We study the feature importances
using Adaboost, GBC, and XGBoost and also from the minimum Redundancy Maximum
Relevance approach to report algorithm-agnostic feature ranking. From these
methods, we find that the signal to noise of the folded profile to be the best
feature. We find that all the ML algorithms report FPRs about an order of
magnitude lower than the corresponding FPRs obtained in arXiv:1406.3627, for
the same recall value.Comment: 14 pages, 2 figures. Accepted for publication in Astronomy and
Computin
Learning Active Learning from Data
In this paper, we suggest a novel data-driven approach to active learning
(AL). The key idea is to train a regressor that predicts the expected error
reduction for a candidate sample in a particular learning state. By formulating
the query selection procedure as a regression problem we are not restricted to
working with existing AL heuristics; instead, we learn strategies based on
experience from previous AL outcomes. We show that a strategy can be learnt
either from simple synthetic 2D datasets or from a subset of domain-specific
data. Our method yields strategies that work well on real data from a wide
range of domains
- …