70,328 research outputs found
Supervised Classification: Quite a Brief Overview
The original problem of supervised classification considers the task of
automatically assigning objects to their respective classes on the basis of
numerical measurements derived from these objects. Classifiers are the tools
that implement the actual functional mapping from these measurements---also
called features or inputs---to the so-called class label---or output. The
fields of pattern recognition and machine learning study ways of constructing
such classifiers. The main idea behind supervised methods is that of learning
from examples: given a number of example input-output relations, to what extent
can the general mapping be learned that takes any new and unseen feature vector
to its correct class? This chapter provides a basic introduction to the
underlying ideas of how to come to a supervised classification problem. In
addition, it provides an overview of some specific classification techniques,
delves into the issues of object representation and classifier evaluation, and
(very) briefly covers some variations on the basic supervised classification
task that may also be of interest to the practitioner
Optimal cross-validation in density estimation with the -loss
We analyze the performance of cross-validation (CV) in the density estimation
framework with two purposes: (i) risk estimation and (ii) model selection. The
main focus is given to the so-called leave--out CV procedure (Lpo), where
denotes the cardinality of the test set. Closed-form expressions are
settled for the Lpo estimator of the risk of projection estimators. These
expressions provide a great improvement upon -fold cross-validation in terms
of variability and computational complexity. From a theoretical point of view,
closed-form expressions also enable to study the Lpo performance in terms of
risk estimation. The optimality of leave-one-out (Loo), that is Lpo with ,
is proved among CV procedures used for risk estimation. Two model selection
frameworks are also considered: estimation, as opposed to identification. For
estimation with finite sample size , optimality is achieved for large
enough [with ] to balance the overfitting resulting from the
structure of the model collection. For identification, model selection
consistency is settled for Lpo as long as is conveniently related to the
rate of convergence of the best estimator in the collection: (i) as
with a parametric rate, and (ii) with some
nonparametric estimators. These theoretical results are validated by simulation
experiments.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1240 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
On Machine-Learned Classification of Variable Stars with Sparse and Noisy Time-Series Data
With the coming data deluge from synoptic surveys, there is a growing need
for frameworks that can quickly and automatically produce calibrated
classification probabilities for newly-observed variables based on a small
number of time-series measurements. In this paper, we introduce a methodology
for variable-star classification, drawing from modern machine-learning
techniques. We describe how to homogenize the information gleaned from light
curves by selection and computation of real-numbered metrics ("feature"),
detail methods to robustly estimate periodic light-curve features, introduce
tree-ensemble methods for accurate variable star classification, and show how
to rigorously evaluate the classification results using cross validation. On a
25-class data set of 1542 well-studied variable stars, we achieve a 22.8%
overall classification error using the random forest classifier; this
represents a 24% improvement over the best previous classifier on these data.
This methodology is effective for identifying samples of specific science
classes: for pulsational variables used in Milky Way tomography we obtain a
discovery efficiency of 98.2% and for eclipsing systems we find an efficiency
of 99.1%, both at 95% purity. We show that the random forest (RF) classifier is
superior to other machine-learned methods in terms of accuracy, speed, and
relative immunity to features with no useful class information; the RF
classifier can also be used to estimate the importance of each feature in
classification. Additionally, we present the first astronomical use of
hierarchical classification methods to incorporate a known class taxonomy in
the classifier, which further reduces the catastrophic error rate to 7.8%.
Excluding low-amplitude sources, our overall error rate improves to 14%, with a
catastrophic error rate of 3.5%.Comment: 23 pages, 9 figure
Predicting trend reversals using market instantaneous state
Collective behaviours taking place in financial markets reveal strongly
correlated states especially during a crisis period. A natural hypothesis is
that trend reversals are also driven by mutual influences between the different
stock exchanges. Using a maximum entropy approach, we find coordinated
behaviour during trend reversals dominated by the pairwise component. In
particular, these events are predicted with high significant accuracy by the
ensemble's instantaneous state.Comment: 18 pages, 15 figure
- …