2,645 research outputs found
Estimating Optimal Active Learning via Model Retraining Improvement
A central question for active learning (AL) is: "what is the optimal
selection?" Defining optimality by classifier loss produces a new
characterisation of optimal AL behaviour, by treating expected loss reduction
as a statistical target for estimation. This target forms the basis of model
retraining improvement (MRI), a novel approach providing a statistical
estimation framework for AL. This framework is constructed to address the
central question of AL optimality, and to motivate the design of estimation
algorithms. MRI allows the exploration of optimal AL behaviour, and the
examination of AL heuristics, showing precisely how they make sub-optimal
selections. The abstract formulation of MRI is used to provide a new guarantee
for AL, that an unbiased MRI estimator should outperform random selection. This
MRI framework reveals intricate estimation issues that in turn motivate the
construction of new statistical AL algorithms. One new algorithm in particular
performs strongly in a large-scale experimental study, compared to standard AL
methods. This competitive performance suggests that practical efforts to
minimise estimation bias may be important for AL applications.Comment: arXiv admin note: substantial text overlap with arXiv:1407.804
Whole-brain Prediction Analysis with GraphNet
Multivariate machine learning methods are increasingly used to analyze
neuroimaging data, often replacing more traditional "mass univariate"
techniques that fit data one voxel at a time. In the functional magnetic
resonance imaging (fMRI) literature, this has led to broad application of
"off-the-shelf" classification and regression methods. These generic approaches
allow investigators to use ready-made algorithms to accurately decode
perceptual, cognitive, or behavioral states from distributed patterns of neural
activity. However, when applied to correlated whole-brain fMRI data these
methods suffer from coefficient instability, are sensitive to outliers, and
yield dense solutions that are hard to interpret without arbitrary
thresholding. Here, we develop variants of the the Graph-constrained Elastic
Net (GraphNet), ..., we (1) extend GraphNet to include robust loss functions
that confer insensitivity to outliers, (2) equip them with "adaptive" penalties
that asymptotically guarantee correct variable selection, and (3) develop a
novel sparse structured Support Vector GraphNet classifier (SVGN). When applied
to previously published data, these efficient whole-brain methods significantly
improved classification accuracy over previously reported VOI-based analyses on
the same data while discovering task-related regions not documented in the
original VOI approach. Critically, GraphNet estimates generalize well to
out-of-sample data collected more than three years later on the same task but
with different subjects and stimuli. By enabling robust and efficient selection
of important voxels from whole-brain data taken over multiple time points
(>100,000 "features"), these methods enable data-driven selection of brain
areas that accurately predict single-trial behavior within and across
individuals
State of the Art in Fair ML: From Moral Philosophy and Legislation to Fair Classifiers
Machine learning is becoming an ever present part in our lives as many
decisions, e.g. to lend a credit, are no longer made by humans but by machine
learning algorithms. However those decisions are often unfair and
discriminating individuals belonging to protected groups based on race or
gender. With the recent General Data Protection Regulation (GDPR) coming into
effect, new awareness has been raised for such issues and with computer
scientists having such a large impact on peoples lives it is necessary that
actions are taken to discover and prevent discrimination. This work aims to
give an introduction into discrimination, legislative foundations to counter it
and strategies to detect and prevent machine learning algorithms from showing
such behavior
Optimal properties of centroid-based classifiers for very high-dimensional data
We show that scale-adjusted versions of the centroid-based classifier enjoys
optimal properties when used to discriminate between two very high-dimensional
populations where the principal differences are in location. The scale
adjustment removes the tendency of scale differences to confound differences in
means. Certain other distance-based methods, for example, those founded on
nearest-neighbor distance, do not have optimal performance in the sense that we
propose. Our results permit varying degrees of sparsity and signal strength to
be treated, and require only mild conditions on dependence of vector
components. Additionally, we permit the marginal distributions of vector
components to vary extensively. In addition to providing theory we explore
numerical properties of a centroid-based classifier, and show that these
features reflect theoretical accounts of performance.Comment: Published in at http://dx.doi.org/10.1214/09-AOS736 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Non-uniform Feature Sampling for Decision Tree Ensembles
We study the effectiveness of non-uniform randomized feature selection in
decision tree classification. We experimentally evaluate two feature selection
methodologies, based on information extracted from the provided dataset:
\emph{leverage scores-based} and \emph{norm-based} feature selection.
Experimental evaluation of the proposed feature selection techniques indicate
that such approaches might be more effective compared to naive uniform feature
selection and moreover having comparable performance to the random forest
algorithm [3]Comment: 7 pages, 7 figures, 1 tabl
Temporally-aware algorithms for the classification of anuran sounds
Several authors have shown that the sounds of anurans can be used as an indicator of
climate change. Hence, the recording, storage and further processing of a huge
number of anuran sounds, distributed over time and space, are required in order to
obtain this indicator. Furthermore, it is desirable to have algorithms and tools for
the automatic classification of the different classes of sounds. In this paper, six
classification methods are proposed, all based on the data-mining domain, which
strive to take advantage of the temporal character of the sounds. The definition and
comparison of these classification methods is undertaken using several approaches.
The main conclusions of this paper are that: (i) the sliding window method attained
the best results in the experiments presented, and even outperformed the hidden
Markov models usually employed in similar applications; (ii) noteworthy overall
classification performance has been obtained, which is an especially striking result
considering that the sounds analysed were affected by a highly noisy background;
(iii) the instance selection for the determination of the sounds in the training dataset
offers better results than cross-validation techniques; and (iv) the temporally-aware
classifiers have revealed that they can obtain better performance than their nontemporally-aware
counterparts.ConsejerÃa de Innovación, Ciencia y Empresa (Junta de AndalucÃa, Spain): excellence eSAPIENS number TIC 570
An Automatic Interaction Detection Hybrid Model for Bankcard Response Classification
In this paper, we propose a hybrid bankcard response model, which integrates
decision tree based chi-square automatic interaction detection (CHAID) into
logistic regression. In the first stage of the hybrid model, CHAID analysis is
used to detect the possibly potential variable interactions. Then in the second
stage, these potential interactions are served as the additional input
variables in logistic regression. The motivation of the proposed hybrid model
is that adding variable interactions may improve the performance of logistic
regression. To demonstrate the effectiveness of the proposed hybrid model, it
is evaluated on a real credit customer response data set. As the results
reveal, by identifying potential interactions among independent variables, the
proposed hybrid approach outperforms the logistic regression without searching
for interactions in terms of classification accuracy, the area under the
receiver operating characteristic curve (ROC), and Kolmogorov-Smirnov (KS)
statistics. Furthermore, CHAID analysis for interaction detection is much more
computationally efficient than the stepwise search mentioned above and some
identified interactions are shown to have statistically significant predictive
power on the target variable. Last but not least, the customer profile created
based on the CHAID tree provides a reasonable interpretation of the
interactions, which is the required by regulations of the credit industry.
Hence, this study provides an alternative for handling bankcard classification
tasks
Interpretable multiclass classification by MDL-based rule lists
Interpretable classifiers have recently witnessed an increase in attention
from the data mining community because they are inherently easier to understand
and explain than their more complex counterparts. Examples of interpretable
classification models include decision trees, rule sets, and rule lists.
Learning such models often involves optimizing hyperparameters, which typically
requires substantial amounts of data and may result in relatively large models.
In this paper, we consider the problem of learning compact yet accurate
probabilistic rule lists for multiclass classification. Specifically, we
propose a novel formalization based on probabilistic rule lists and the minimum
description length (MDL) principle. This results in virtually parameter-free
model selection that naturally allows to trade-off model complexity with
goodness of fit, by which overfitting and the need for hyperparameter tuning
are effectively avoided. Finally, we introduce the Classy algorithm, which
greedily finds rule lists according to the proposed criterion. We empirically
demonstrate that Classy selects small probabilistic rule lists that outperform
state-of-the-art classifiers when it comes to the combination of predictive
performance and interpretability. We show that Classy is insensitive to its
only parameter, i.e., the candidate set, and that compression on the training
set correlates with classification performance, validating our MDL-based
selection criterion
Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning
The correct use of model evaluation, model selection, and algorithm selection
techniques is vital in academic machine learning research as well as in many
industrial settings. This article reviews different techniques that can be used
for each of these three subtasks and discusses the main advantages and
disadvantages of each technique with references to theoretical and empirical
studies. Further, recommendations are given to encourage best yet feasible
practices in research and applications of machine learning. Common methods such
as the holdout method for model evaluation and selection are covered, which are
not recommended when working with small datasets. Different flavors of the
bootstrap technique are introduced for estimating the uncertainty of
performance estimates, as an alternative to confidence intervals via normal
approximation if bootstrapping is computationally feasible. Common
cross-validation techniques such as leave-one-out cross-validation and k-fold
cross-validation are reviewed, the bias-variance trade-off for choosing k is
discussed, and practical tips for the optimal choice of k are given based on
empirical evidence. Different statistical tests for algorithm comparisons are
presented, and strategies for dealing with multiple comparisons such as omnibus
tests and multiple-comparison corrections are discussed. Finally, alternative
methods for algorithm selection, such as the combined F-test 5x2
cross-validation and nested cross-validation, are recommended for comparing
machine learning algorithms when datasets are small.Comment: v2: minor typo fixe
Min-Max Kernels
The min-max kernel is a generalization of the popular resemblance kernel
(which is designed for binary data). In this paper, we demonstrate, through an
extensive classification study using kernel machines, that the min-max kernel
often provides an effective measure of similarity for nonnegative data. As the
min-max kernel is nonlinear and might be difficult to be used for industrial
applications with massive data, we show that the min-max kernel can be
linearized via hashing techniques. This allows practitioners to apply min-max
kernel to large-scale applications using well matured linear algorithms such as
linear SVM or logistic regression.
The previous remarkable work on consistent weighted sampling (CWS) produces
samples in the form of () where the records the location (and
in fact also the weights) information analogous to the samples produced by
classical minwise hashing on binary data. Because the is theoretically
unbounded, it was not immediately clear how to effectively implement CWS for
building large-scale linear classifiers. In this paper, we provide a simple
solution by discarding (which we refer to as the "0-bit" scheme). Via an
extensive empirical study, we show that this 0-bit scheme does not lose
essential information. We then apply the "0-bit" CWS for building linear
classifiers to approximate min-max kernel classifiers, as extensively validated
on a wide range of publicly available classification datasets. We expect this
work will generate interests among data mining practitioners who would like to
efficiently utilize the nonlinear information of non-binary and nonnegative
data
- …