5,129 research outputs found
Resampling methods for parameter-free and robust feature selection with mutual information
Combining the mutual information criterion with a forward feature selection
strategy offers a good trade-off between optimality of the selected feature
subset and computation time. However, it requires to set the parameter(s) of
the mutual information estimator and to determine when to halt the forward
procedure. These two choices are difficult to make because, as the
dimensionality of the subset increases, the estimation of the mutual
information becomes less and less reliable. This paper proposes to use
resampling methods, a K-fold cross-validation and the permutation test, to
address both issues. The resampling methods bring information about the
variance of the estimator, information which can then be used to automatically
set the parameter and to calculate a threshold to stop the forward procedure.
The procedure is illustrated on a synthetic dataset as well as on real-world
examples
Advances in Feature Selection with Mutual Information
The selection of features that are relevant for a prediction or
classification problem is an important problem in many domains involving
high-dimensional data. Selecting features helps fighting the curse of
dimensionality, improving the performances of prediction or classification
methods, and interpreting the application. In a nonlinear context, the mutual
information is widely used as relevance criterion for features and sets of
features. Nevertheless, it suffers from at least three major limitations:
mutual information estimators depend on smoothing parameters, there is no
theoretically justified stopping criterion in the feature selection greedy
procedure, and the estimation itself suffers from the curse of dimensionality.
This chapter shows how to deal with these problems. The two first ones are
addressed by using resampling techniques that provide a statistical basis to
select the estimator parameters and to stop the search procedure. The third one
is addressed by modifying the mutual information criterion into a measure of
how features are complementary (and not only informative) for the problem at
hand
Measures of Analysis of Time Series (MATS): A MATLAB Toolkit for Computation of Multiple Measures on Time Series Data Bases
In many applications, such as physiology and finance, large time series data
bases are to be analyzed requiring the computation of linear, nonlinear and
other measures. Such measures have been developed and implemented in commercial
and freeware softwares rather selectively and independently. The Measures of
Analysis of Time Series ({\tt MATS}) {\tt MATLAB} toolkit is designed to handle
an arbitrary large set of scalar time series and compute a large variety of
measures on them, allowing for the specification of varying measure parameters
as well. The variety of options with added facilities for visualization of the
results support different settings of time series analysis, such as the
detection of dynamics changes in long data records, resampling (surrogate or
bootstrap) tests for independence and linearity with various test statistics,
and discrimination power of different measures and for different combinations
of their parameters. The basic features of {\tt MATS} are presented and the
implemented measures are briefly described. The usefulness of {\tt MATS} is
illustrated on some empirical examples along with screenshots.Comment: 25 pages, 9 figures, two tables, the software can be downloaded at
http://eeganalysis.web.auth.gr/indexen.ht
Measures of Analysis of Time Series (MATS): A MATLAB Toolkit for Computation of Multiple Measures on Time Series Data Bases
In many applications, such as physiology and finance, large time series data bases are to be analyzed requiring the computation of linear, nonlinear and other measures. Such measures have been developed and implemented in commercial and freeware softwares rather selectively and independently. The Measures of Analysis of Time Series (MATS) MATLAB toolkit is designed to handle an arbitrary large set of scalar time series and compute a large variety of measures on them, allowing for the specification of varying measure parameters as well. The variety of options with added facilities for visualization of the results support different settings of time series analysis, such as the detection of dynamics changes in long data records, resampling (surrogate or bootstrap) tests for independence and linearity with various test statistics, and discrimination power of different measures and for different combinations of their parameters. The basic features of MATS are presented and the implemented measures are briefly described. The usefulness of MATS is illustrated on some empirical examples along with screenshots.
Editorial Comment on the Special Issue of "Information in Dynamical Systems and Complex Systems"
This special issue collects contributions from the participants of the
"Information in Dynamical Systems and Complex Systems" workshop, which cover a
wide range of important problems and new approaches that lie in the
intersection of information theory and dynamical systems. The contributions
include theoretical characterization and understanding of the different types
of information flow and causality in general stochastic processes, inference
and identification of coupling structure and parameters of system dynamics,
rigorous coarse-grain modeling of network dynamical systems, and exact
statistical testing of fundamental information-theoretic quantities such as the
mutual information. The collective efforts reported herein reflect a modern
perspective of the intimate connection between dynamical systems and
information flow, leading to the promise of better understanding and modeling
of natural complex systems and better/optimal design of engineering systems
Multiscale relevance and informative encoding in neuronal spike trains
Neuronal responses to complex stimuli and tasks can encompass a wide range of
time scales. Understanding these responses requires measures that characterize
how the information on these response patterns are represented across multiple
temporal resolutions. In this paper we propose a metric -- which we call
multiscale relevance (MSR) -- to capture the dynamical variability of the
activity of single neurons across different time scales. The MSR is a
non-parametric, fully featureless indicator in that it uses only the time
stamps of the firing activity without resorting to any a priori covariate or
invoking any specific structure in the tuning curve for neural activity. When
applied to neural data from the mEC and from the ADn and PoS regions of
freely-behaving rodents, we found that neurons having low MSR tend to have low
mutual information and low firing sparsity across the correlates that are
believed to be encoded by the region of the brain where the recordings were
made. In addition, neurons with high MSR contain significant information on
spatial navigation and allow to decode spatial position or head direction as
efficiently as those neurons whose firing activity has high mutual information
with the covariate to be decoded and significantly better than the set of
neurons with high local variations in their interspike intervals. Given these
results, we propose that the MSR can be used as a measure to rank and select
neurons for their information content without the need to appeal to any a
priori covariate.Comment: 38 pages, 16 figure
Diverse correlation structures in gene expression data and their utility in improving statistical inference
It is well known that correlations in microarray data represent a serious
nuisance deteriorating the performance of gene selection procedures. This paper
is intended to demonstrate that the correlation structure of microarray data
provides a rich source of useful information. We discuss distinct correlation
substructures revealed in microarray gene expression data by an appropriate
ordering of genes. These substructures include stochastic proportionality of
expression signals in a large percentage of all gene pairs, negative
correlations hidden in ordered gene triples, and a long sequence of weakly
dependent random variables associated with ordered pairs of genes. The reported
striking regularities are of general biological interest and they also have
far-reaching implications for theory and practice of statistical methods of
microarray data analysis. We illustrate the latter point with a method for
testing differential expression of nonoverlapping gene pairs. While designed
for testing a different null hypothesis, this method provides an order of
magnitude more accurate control of type 1 error rate compared to conventional
methods of individual gene expression profiling. In addition, this method is
robust to the technical noise. Quantitative inference of the correlation
structure has the potential to extend the analysis of microarray data far
beyond currently practiced methods.Comment: Published in at http://dx.doi.org/10.1214/07-AOAS120 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …