5,649 research outputs found
Graph Estimation From Multi-attribute Data
Many real world network problems often concern multivariate nodal attributes
such as image, textual, and multi-view feature vectors on nodes, rather than
simple univariate nodal attributes. The existing graph estimation methods built
on Gaussian graphical models and covariance selection algorithms can not handle
such data, neither can the theories developed around such methods be directly
applied. In this paper, we propose a new principled framework for estimating
graphs from multi-attribute data. Instead of estimating the partial correlation
as in current literature, our method estimates the partial canonical
correlations that naturally accommodate complex nodal features.
Computationally, we provide an efficient algorithm which utilizes the
multi-attribute structure. Theoretically, we provide sufficient conditions
which guarantee consistent graph recovery. Extensive simulation studies
demonstrate performance of our method under various conditions. Furthermore, we
provide illustrative applications to uncovering gene regulatory networks from
gene and protein profiles, and uncovering brain connectivity graph from
functional magnetic resonance imaging data.Comment: Extended simulation study. Added an application to a new data se
Weighted k-Nearest-Neighbor Techniques and Ordinal Classification
In the field of statistical discrimination k-nearest neighbor classification is a well-known, easy and successful method. In this paper we present an extended version of this technique, where the distances of the nearest neighbors can be taken into account. In this sense there is a close connection to LOESS, a local regression technique. In addition we show possibilities to use nearest neighbor for classification in the case of an ordinal class structure. Empirical studies show the advantages of the new techniques
Feature selection for microarray gene expression data using simulated annealing guided by the multivariate joint entropy
In this work a new way to calculate the multivariate joint entropy is presented. This measure is the basis for a fast information-theoretic based evaluation of gene relevance in a Microarray Gene Expression data context. Its low complexity is based on the reuse of previous computations to calculate current feature relevance. The mu-TAFS algorithm --named as such to differentiate it from previous TAFS algorithms-- implements a simulated annealing technique specially designed for feature subset selection. The algorithm is applied to the maximization of gene subset relevance in several public-domain microarray data sets. The experimental results show a notoriously high classification performance and low size subsets formed by biologically meaningful genes.Postprint (published version
Partial least squares discriminant analysis: A dimensionality reduction method to classify hyperspectral data
The recent development of more sophisticated spectroscopic methods allows acquisition of high dimensional datasets from which valuable information may be extracted using multivariate statistical analyses, such as dimensionality reduction and automatic classification (supervised and unsupervised). In this work, a supervised classification through a partial least squares discriminant analysis (PLS-DA) is performed on the hy- perspectral data. The obtained results are compared with those obtained by the most commonly used classification approaches
Bandwidth choice for nonparametric classification
It is shown that, for kernel-based classification with univariate
distributions and two populations, optimal bandwidth choice has a dichotomous
character. If the two densities cross at just one point, where their curvatures
have the same signs, then minimum Bayes risk is achieved using bandwidths which
are an order of magnitude larger than those which minimize pointwise estimation
error. On the other hand, if the curvature signs are different, or if there are
multiple crossing points, then bandwidths of conventional size are generally
appropriate. The range of different modes of behavior is narrower in
multivariate settings. There, the optimal size of bandwidth is generally the
same as that which is appropriate for pointwise density estimation. These
properties motivate empirical rules for bandwidth choice.Comment: Published at http://dx.doi.org/10.1214/009053604000000959 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …