2,430 research outputs found
Resampling methods for parameter-free and robust feature selection with mutual information
Combining the mutual information criterion with a forward feature selection
strategy offers a good trade-off between optimality of the selected feature
subset and computation time. However, it requires to set the parameter(s) of
the mutual information estimator and to determine when to halt the forward
procedure. These two choices are difficult to make because, as the
dimensionality of the subset increases, the estimation of the mutual
information becomes less and less reliable. This paper proposes to use
resampling methods, a K-fold cross-validation and the permutation test, to
address both issues. The resampling methods bring information about the
variance of the estimator, information which can then be used to automatically
set the parameter and to calculate a threshold to stop the forward procedure.
The procedure is illustrated on a synthetic dataset as well as on real-world
examples
Unsupervised robust nonparametric learning of hidden community properties
We consider learning of fundamental properties of communities in large noisy
networks, in the prototypical situation where the nodes or users are split into
two classes according to a binary property, e.g., according to their opinions
or preferences on a topic. For learning these properties, we propose a
nonparametric, unsupervised, and scalable graph scan procedure that is, in
addition, robust against a class of powerful adversaries. In our setup, one of
the communities can fall under the influence of a knowledgeable adversarial
leader, who knows the full network structure, has unlimited computational
resources and can completely foresee our planned actions on the network. We
prove strong consistency of our results in this setup with minimal assumptions.
In particular, the learning procedure estimates the baseline activity of normal
users asymptotically correctly with probability 1; the only assumption being
the existence of a single implicit community of asymptotically negligible
logarithmic size. We provide experiments on real and synthetic data to
illustrate the performance of our method, including examples with adversaries.Comment: Experiments with new types of adversaries adde
A new class of multiscale lattice cell (MLC) models for spatio-temporal evolutionary image representation
Spatio-temporal evolutionary (STE) images are a class of complex dynamical systems that evolve over both space and time. With increased interest in the investigation of nonlinear complex phenomena, especially spatio-temporal behaviour governed by evolutionary laws that are dependent
on both spatial and temporal dimensions, there has been an increased need to investigate model identification methods for this class of complex systems. Compared with pure temporal processes, the identification of spatio-temporal models from observed images is much more difficult and quite
challenging. Starting with an assumption that there is no apriori information about the true model but
only observed data are available, this study introduces a new class of multiscale lattice cell (MLC)
models to represent the rules of the associated spatio-temporal evolutionary system. An application to a chemical reaction exhibiting a spatio-temporal evolutionary behaviour, is investigated to demonstrate the new modelling framework
Estimating Gene Interactions Using Information Theoretic Functionals
With an abundance of data resulting from high-throughput technologies, like DNA microarrays,
a race has been on the last few years, to determine the structures and functions of genes and
their products, the proteins. Inference of gene interactions, lies in the core of these efforts.
In all this activity, three important research issues have emerged. First, in much of the current
literature on gene regulatory networks, dependencies among variables in our case genes - are
assumed to be linear in nature, when in fact, in real-life scenarios this is seldom the case.
This disagreement leads to systematic deviation and biased evaluation. Secondly, although
the problem of undersampling, features in every piece of work as one of the major causes for
poor results, in practice it is overlooked and rarely addressed explicitly. Finally, inference
of network structures, although based on rigid mathematical foundations and computational
optimizations, often displays poor fitness values and biologically unrealistic link structures, due
- to a large extend - to the discovery of pairwise only interactions.
In our search for robust, nonlinear measures of dependency, we advocate that mutual information
and related information theoretic functionals (conditional mutual information, total
correlation) are possibly the most suitable candidates to capture both linear and nonlinear
interactions between variables, and resolve higher order dependencies.
To address these issues, we researched and implemented under a common framework, a selection
nonparametric estimators of mutual information for continuous variables. The focus of their
assessment was, their robustness to the limited sample sizes and their expansibility to higher
dimensions - important for the detection of more complex interaction structures. Two different
assessment scenaria were performed, one with simulated data and one with bootstrapping the
estimators in state-of-the-art network inference algorithms and monitor their predictive power
and sensitivity. The tests revealed that, in small sample size regimes, there is a significant difference
in the performance of different estimators, and naive methods such as uniform binning,
gave consistently poor results compared with more sophisticated methods.
Finally, a custom, modular mechanism is proposed, for the inference of gene interactions,
targeting the identi cation of some of the most common substructures in genetic networks,
that we believe will help improve accuracy and predictability scores
Critical values of a kernel density-based mutual information estimator
Copyright © 2006 IEEERecently, mutual information (MI) has become widely recognized as a statistical measure of dependence that is suitable for applications where data are non-Gaussian, or where the dependency between variables is non-linear. However, a significant disadvantage of this measure is the inability to define an analytical expression for the distribution of MI estimators, which are based upon a finite dataset. This paper deals specifically with a popular kernel density based estimator, for which the distribution is determined empirically using Monte Carlo simulation. The application of the critical values of MI derived from this distribution to a test for independence is demonstrated within the context of a benchmark input variable selection problem.http://www.okstate.edu/elec-engr/faculty/yen/wcci/WCCI-Web_ProgramList_F.htm
- …