22,431 research outputs found
Independence clustering (without a matrix)
The independence clustering problem is considered in the following
formulation: given a set of random variables, it is required to find the
finest partitioning of into clusters such that the
clusters are mutually independent. Since mutual independence is
the target, pairwise similarity measurements are of no use, and thus
traditional clustering algorithms are inapplicable. The distribution of the
random variables in is, in general, unknown, but a sample is available.
Thus, the problem is cast in terms of time series. Two forms of sampling are
considered: i.i.d.\ and stationary time series, with the main emphasis being on
the latter, more general, case. A consistent, computationally tractable
algorithm for each of the settings is proposed, and a number of open directions
for further research are outlined
Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks
We present a procedure for effective estimation of entropy and mutual
information from small-sample data, and apply it to the problem of inferring
high-dimensional gene association networks. Specifically, we develop a
James-Stein-type shrinkage estimator, resulting in a procedure that is highly
efficient statistically as well as computationally. Despite its simplicity, we
show that it outperforms eight other entropy estimation procedures across a
diverse range of sampling scenarios and data-generating models, even in cases
of severe undersampling. We illustrate the approach by analyzing E. coli gene
expression data and computing an entropy-based gene-association network from
gene expression data. A computer program is available that implements the
proposed shrinkage estimator.Comment: 18 pages, 3 figures, 1 tabl
Advances in Feature Selection with Mutual Information
The selection of features that are relevant for a prediction or
classification problem is an important problem in many domains involving
high-dimensional data. Selecting features helps fighting the curse of
dimensionality, improving the performances of prediction or classification
methods, and interpreting the application. In a nonlinear context, the mutual
information is widely used as relevance criterion for features and sets of
features. Nevertheless, it suffers from at least three major limitations:
mutual information estimators depend on smoothing parameters, there is no
theoretically justified stopping criterion in the feature selection greedy
procedure, and the estimation itself suffers from the curse of dimensionality.
This chapter shows how to deal with these problems. The two first ones are
addressed by using resampling techniques that provide a statistical basis to
select the estimator parameters and to stop the search procedure. The third one
is addressed by modifying the mutual information criterion into a measure of
how features are complementary (and not only informative) for the problem at
hand
- …