29,690 research outputs found
Evaluation methods and decision theory for classification of streaming data with temporal dependence
Predictive modeling on data streams plays an important role in modern data analysis, where data arrives continuously and needs to be mined in real time. In the stream setting the data distribution is often evolving over time, and models that update themselves during operation are becoming the state-of-the-art. This paper formalizes a learning and evaluation scheme of such predictive models. We theoretically analyze evaluation of classifiers on streaming data with temporal dependence. Our findings suggest that the commonly accepted data stream classification measures, such as classification accuracy and Kappa statistic, fail to diagnose cases of poor performance when temporal dependence is present, therefore they should not be used as sole performance indicators. Moreover, classification accuracy can be misleading if used as a proxy for evaluating change detectors with datasets that have temporal dependence. We formulate the decision theory for streaming data classification with temporal dependence and develop a new evaluation methodology for data stream classification that takes temporal dependence into account. We propose a combined measure for classification performance, that takes into account temporal dependence, and we recommend using it as the main performance measure in classification of streaming data
Feature and Variable Selection in Classification
The amount of information in the form of features and variables avail- able
to machine learning algorithms is ever increasing. This can lead to classifiers
that are prone to overfitting in high dimensions, high di- mensional models do
not lend themselves to interpretable results, and the CPU and memory resources
necessary to run on high-dimensional datasets severly limit the applications of
the approaches. Variable and feature selection aim to remedy this by finding a
subset of features that in some way captures the information provided best. In
this paper we present the general methodology and highlight some specific
approaches.Comment: Part of master seminar in document analysis held by Marcus
Eichenberger-Liwick
Exploring Two Novel Features for EEG-based Brain-Computer Interfaces: Multifractal Cumulants and Predictive Complexity
In this paper, we introduce two new features for the design of
electroencephalography (EEG) based Brain-Computer Interfaces (BCI): one feature
based on multifractal cumulants, and one feature based on the predictive
complexity of the EEG time series. The multifractal cumulants feature measures
the signal regularity, while the predictive complexity measures the difficulty
to predict the future of the signal based on its past, hence a degree of how
complex it is. We have conducted an evaluation of the performance of these two
novel features on EEG data corresponding to motor-imagery. We also compared
them to the most successful features used in the BCI field, namely the
Band-Power features. We evaluated these three kinds of features and their
combinations on EEG signals from 13 subjects. Results obtained show that our
novel features can lead to BCI designs with improved classification
performance, notably when using and combining the three kinds of feature
(band-power, multifractal cumulants, predictive complexity) together.Comment: Updated with more subjects. Separated out the band-power comparisons
in a companion article after reviewer feedback. Source code and companion
article are available at
http://nicolas.brodu.numerimoire.net/en/recherche/publication
The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures
Motivation: Biomarker discovery from high-dimensional data is a crucial
problem with enormous applications in biology and medicine. It is also
extremely challenging from a statistical viewpoint, but surprisingly few
studies have investigated the relative strengths and weaknesses of the plethora
of existing feature selection methods. Methods: We compare 32 feature selection
methods on 4 public gene expression datasets for breast cancer prognosis, in
terms of predictive performance, stability and functional interpretability of
the signatures they produce. Results: We observe that the feature selection
method has a significant influence on the accuracy, stability and
interpretability of signatures. Simple filter methods generally outperform more
complex embedded or wrapper methods, and ensemble feature selection has
generally no positive effect. Overall a simple Student's t-test seems to
provide the best results. Availability: Code and data are publicly available at
http://cbio.ensmp.fr/~ahaury/
- …