4,969 research outputs found
Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning
The correct use of model evaluation, model selection, and algorithm selection
techniques is vital in academic machine learning research as well as in many
industrial settings. This article reviews different techniques that can be used
for each of these three subtasks and discusses the main advantages and
disadvantages of each technique with references to theoretical and empirical
studies. Further, recommendations are given to encourage best yet feasible
practices in research and applications of machine learning. Common methods such
as the holdout method for model evaluation and selection are covered, which are
not recommended when working with small datasets. Different flavors of the
bootstrap technique are introduced for estimating the uncertainty of
performance estimates, as an alternative to confidence intervals via normal
approximation if bootstrapping is computationally feasible. Common
cross-validation techniques such as leave-one-out cross-validation and k-fold
cross-validation are reviewed, the bias-variance trade-off for choosing k is
discussed, and practical tips for the optimal choice of k are given based on
empirical evidence. Different statistical tests for algorithm comparisons are
presented, and strategies for dealing with multiple comparisons such as omnibus
tests and multiple-comparison corrections are discussed. Finally, alternative
methods for algorithm selection, such as the combined F-test 5x2
cross-validation and nested cross-validation, are recommended for comparing
machine learning algorithms when datasets are small.Comment: v2: minor typo fixe
Efficient Estimation of Generalization Error and Bias-Variance Components of Ensembles
For many applications, an ensemble of base classifiers is an effective
solution. The tuning of its parameters(number of classes, amount of data on
which each classifier is to be trained on, etc.) requires G, the generalization
error of a given ensemble. The efficient estimation of G is the focus of this
paper. The key idea is to approximate the variance of the class
scores/probabilities of the base classifiers over the randomness imposed by the
training subset by normal/beta distribution at each point x in the input
feature space. We estimate the parameters of the distribution using a small set
of randomly chosen base classifiers and use those parameters to give efficient
estimation schemes for G. We give empirical evidence for the quality of the
various estimators. We also demonstrate their usefulness in making design
choices such as the number of classifiers in the ensemble and the size of a
subset of data used for training that is needed to achieve a certain value of
generalization error. Our approach also has great potential for designing
distributed ensemble classifiers.Comment: 12 Pages, 4 Figures, 12 Pages, Under Review in SDM 201
PaloBoost: An Overfitting-robust TreeBoost with Out-of-Bag Sample Regularization Techniques
Stochastic Gradient TreeBoost is often found in many winning solutions in
public data science challenges. Unfortunately, the best performance requires
extensive parameter tuning and can be prone to overfitting. We propose
PaloBoost, a Stochastic Gradient TreeBoost model that uses novel regularization
techniques to guard against overfitting and is robust to parameter settings.
PaloBoost uses the under-utilized out-of-bag samples to perform gradient-aware
pruning and estimate adaptive learning rates. Unlike other Stochastic Gradient
TreeBoost models that use the out-of-bag samples to estimate test errors,
PaloBoost treats the samples as a second batch of training samples to prune the
trees and adjust the learning rates. As a result, PaloBoost can dynamically
adjust tree depths and learning rates to achieve faster learning at the start
and slower learning as the algorithm converges. We illustrate how these
regularization techniques can be efficiently implemented and propose a new
formula for calculating feature importance to reflect the node coverages and
learning rates. Extensive experimental results on seven datasets demonstrate
that PaloBoost is robust to overfitting, is less sensitivity to the parameters,
and can also effectively identify meaningful features
Penalized Split Criteria for Interpretable Trees
This paper describes techniques for growing classification and regression
trees designed to induce visually interpretable trees. This is achieved by
penalizing splits that extend the subset of features used in a particular
branch of the tree. After a brief motivation, we summarize existing methods and
introduce new ones, providing illustrative examples throughout. Using a number
of real classification and regression datasets, we find that these procedures
can offer more interpretable fits than the CART methodology with very modest
increases in out-of-sample loss.Comment: 25 page
A no-regret generalization of hierarchical softmax to extreme multi-label classification
Extreme multi-label classification (XMLC) is a problem of tagging an instance
with a small subset of relevant labels chosen from an extremely large pool of
possible labels. Large label spaces can be efficiently handled by organizing
labels as a tree, like in the hierarchical softmax (HSM) approach commonly used
for multi-class problems. In this paper, we investigate probabilistic label
trees (PLTs) that have been recently devised for tackling XMLC problems. We
show that PLTs are a no-regret multi-label generalization of HSM when
precision@k is used as a model evaluation metric. Critically, we prove that
pick-one-label heuristic - a reduction technique from multi-label to
multi-class that is routinely used along with HSM - is not consistent in
general. We also show that our implementation of PLTs, referred to as
extremeText (XT), obtains significantly better results than HSM with the
pick-one-label heuristic and XML-CNN, a deep network specifically designed for
XMLC problems. Moreover, XT is competitive to many state-of-the-art approaches
in terms of statistical performance, model size and prediction time which makes
it amenable to deploy in an online system.Comment: Accepted at NIPS 201
Convex Formulation of Multiple Instance Learning from Positive and Unlabeled Bags
Multiple instance learning (MIL) is a variation of traditional supervised
learning problems where data (referred to as bags) are composed of sub-elements
(referred to as instances) and only bag labels are available. MIL has a variety
of applications such as content-based image retrieval, text categorization and
medical diagnosis. Most of the previous work for MIL assume that the training
bags are fully labeled. However, it is often difficult to obtain an enough
number of labeled bags in practical situations, while many unlabeled bags are
available. A learning framework called PU learning (positive and unlabeled
learning) can address this problem. In this paper, we propose a convex PU
learning method to solve an MIL problem. We experimentally show that the
proposed method achieves better performance with significantly lower
computational costs than an existing method for PU-MIL
Predicting the direction of stock market prices using random forest
Predicting trends in stock market prices has been an area of interest for
researchers for many years due to its complex and dynamic nature. Intrinsic
volatility in stock market across the globe makes the task of prediction
challenging. Forecasting and diffusion modeling, although effective can't be
the panacea to the diverse range of problems encountered in prediction,
short-term or otherwise. Market risk, strongly correlated with forecasting
errors, needs to be minimized to ensure minimal risk in investment. The authors
propose to minimize forecasting error by treating the forecasting problem as a
classification problem, a popular suite of algorithms in Machine learning. In
this paper, we propose a novel way to minimize the risk of investment in stock
market by predicting the returns of a stock using a class of powerful machine
learning algorithms known as ensemble learning. Some of the technical
indicators such as Relative Strength Index (RSI), stochastic oscillator etc are
used as inputs to train our model. The learning model used is an ensemble of
multiple decision trees. The algorithm is shown to outperform existing algo-
rithms found in the literature. Out of Bag (OOB) error estimates have been
found to be encouraging. Key Words: Random Forest Classifier, stock price
forecasting, Exponential smoothing, feature extraction, OOB error and
convergence
Towards Automatic Construction of Diverse, High-quality Image Dataset
The availability of labeled image datasets has been shown critical for
high-level image understanding, which continuously drives the progress of
feature designing and models developing. However, constructing labeled image
datasets is laborious and monotonous. To eliminate manual annotation, in this
work, we propose a novel image dataset construction framework by employing
multiple textual queries. We aim at collecting diverse and accurate images for
given queries from the Web. Specifically, we formulate noisy textual queries
removing and noisy images filtering as a multi-view and multi-instance learning
problem separately. Our proposed approach not only improves the accuracy but
also enhances the diversity of the selected images. To verify the effectiveness
of our proposed approach, we construct an image dataset with 100 categories.
The experiments show significant performance gains by using the generated data
of our approach on several tasks, such as image classification, cross-dataset
generalization, and object detection. The proposed method also consistently
outperforms existing weakly supervised and web-supervised approaches.Comment: Accepted by IEEE Transactions on Knowledge and Data Engineerin
Efficient Training on Very Large Corpora via Gramian Estimation
We study the problem of learning similarity functions over very large corpora
using neural network embedding models. These models are typically trained using
SGD with sampling of random observed and unobserved pairs, with a number of
samples that grows quadratically with the corpus size, making it expensive to
scale to very large corpora. We propose new efficient methods to train these
models without having to sample unobserved pairs. Inspired by matrix
factorization, our approach relies on adding a global quadratic penalty to all
pairs of examples and expressing this term as the matrix-inner-product of two
generalized Gramians. We show that the gradient of this term can be efficiently
computed by maintaining estimates of the Gramians, and develop variance
reduction schemes to improve the quality of the estimates. We conduct
large-scale experiments that show a significant improvement in training time
and generalization quality compared to traditional sampling methods
Effective Representations of Clinical Notes
Clinical notes are a rich source of information about patient state. However,
using them to predict clinical events with machine learning models is
challenging. They are very high dimensional, sparse and have complex structure.
Furthermore, training data is often scarce because it is expensive to obtain
reliable labels for many clinical events. These difficulties have traditionally
been addressed by manual feature engineering encoding task specific domain
knowledge. We explored the use of neural networks and transfer learning to
learn representations of clinical notes that are useful for predicting future
clinical events of interest, such as all causes mortality, inpatient
admissions, and emergency room visits. Our data comprised 2.7 million notes and
115 thousand patients at Stanford Hospital. We used the learned
representations, along with commonly used bag of words and topic model
representations, as features for predictive models of clinical events. We
evaluated the effectiveness of these representations with respect to the
performance of the models trained on small datasets. Models using the neural
network derived representations performed significantly better than models
using the baseline representations with small () training datasets.
The learned representations offer significant performance gains over commonly
used baseline representations for a range of predictive modeling tasks and
cohort sizes, offering an effective alternative to task specific feature
engineering when plentiful labeled training data is not available
- …