2,993 research outputs found
Optimal Clustering under Uncertainty
Classical clustering algorithms typically either lack an underlying
probability framework to make them predictive or focus on parameter estimation
rather than defining and minimizing a notion of error. Recent work addresses
these issues by developing a probabilistic framework based on the theory of
random labeled point processes and characterizing a Bayes clusterer that
minimizes the number of misclustered points. The Bayes clusterer is analogous
to the Bayes classifier. Whereas determining a Bayes classifier requires full
knowledge of the feature-label distribution, deriving a Bayes clusterer
requires full knowledge of the point process. When uncertain of the point
process, one would like to find a robust clusterer that is optimal over the
uncertainty, just as one may find optimal robust classifiers with uncertain
feature-label distributions. Herein, we derive an optimal robust clusterer by
first finding an effective random point process that incorporates all
randomness within its own probabilistic structure and from which a Bayes
clusterer can be derived that provides an optimal robust clusterer relative to
the uncertainty. This is analogous to the use of effective class-conditional
distributions in robust classification. After evaluating the performance of
robust clusterers in synthetic mixtures of Gaussians models, we apply the
framework to granular imaging, where we make use of the asymptotic
granulometric moment theory for granular images to relate robust clustering
theory to the application.Comment: 19 pages, 5 eps figures, 1 tabl
Nonparametric Feature Extraction from Dendrograms
We propose feature extraction from dendrograms in a nonparametric way. The
Minimax distance measures correspond to building a dendrogram with single
linkage criterion, with defining specific forms of a level function and a
distance function over that. Therefore, we extend this method to arbitrary
dendrograms. We develop a generalized framework wherein different distance
measures can be inferred from different types of dendrograms, level functions
and distance functions. Via an appropriate embedding, we compute a vector-based
representation of the inferred distances, in order to enable many numerical
machine learning algorithms to employ such distances. Then, to address the
model selection problem, we study the aggregation of different dendrogram-based
distances respectively in solution space and in representation space in the
spirit of deep representations. In the first approach, for example for the
clustering problem, we build a graph with positive and negative edge weights
according to the consistency of the clustering labels of different objects
among different solutions, in the context of ensemble methods. Then, we use an
efficient variant of correlation clustering to produce the final clusters. In
the second approach, we investigate the sequential combination of different
distances and features sequentially in the spirit of multi-layered
architectures to obtain the final features. Finally, we demonstrate the
effectiveness of our approach via several numerical studies
Unsupervised Representation Learning with Minimax Distance Measures
We investigate the use of Minimax distances to extract in a nonparametric way
the features that capture the unknown underlying patterns and structures in the
data. We develop a general-purpose and computationally efficient framework to
employ Minimax distances with many machine learning methods that perform on
numerical data. We study both computing the pairwise Minimax distances for all
pairs of objects and as well as computing the Minimax distances of all the
objects to/from a fixed (test) object.
We first efficiently compute the pairwise Minimax distances between the
objects, using the equivalence of Minimax distances over a graph and over a
minimum spanning tree constructed on that. Then, we perform an embedding of the
pairwise Minimax distances into a new vector space, such that their squared
Euclidean distances in the new space equal to the pairwise Minimax distances in
the original space. We also study the case of having multiple pairwise Minimax
matrices, instead of a single one. Thereby, we propose an embedding via first
summing up the centered matrices and then performing an eigenvalue
decomposition to obtain the relevant features.
In the following, we study computing Minimax distances from a fixed (test)
object which can be used for instance in K-nearest neighbor search. Similar to
the case of all-pair pairwise Minimax distances, we develop an efficient and
general-purpose algorithm that is applicable with any arbitrary base distance
measure. Moreover, we investigate in detail the edges selected by the Minimax
distances and thereby explore the ability of Minimax distances in detecting
outlier objects.
Finally, for each setting, we perform several experiments to demonstrate the
effectiveness of our framework.Comment: 32 page
Efficient Sparse Clustering of High-Dimensional Non-spherical Gaussian Mixtures
We consider the problem of clustering data points in high dimensions, i.e.
when the number of data points may be much smaller than the number of
dimensions. Specifically, we consider a Gaussian mixture model (GMM) with
non-spherical Gaussian components, where the clusters are distinguished by only
a few relevant dimensions. The method we propose is a combination of a recent
approach for learning parameters of a Gaussian mixture model and sparse linear
discriminant analysis (LDA). In addition to cluster assignments, the method
returns an estimate of the set of features relevant for clustering. Our results
indicate that the sample complexity of clustering depends on the sparsity of
the relevant feature set, while only scaling logarithmically with the ambient
dimension. Additionally, we require much milder assumptions than existing work
on clustering in high dimensions. In particular, we do not require spherical
clusters nor necessitate mean separation along relevant dimensions.Comment: 11 pages, 1 figur
Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
While several papers have investigated computationally and statistically
efficient methods for learning Gaussian mixtures, precise minimax bounds for
their statistical performance as well as fundamental limits in high-dimensional
settings are not well-understood. In this paper, we provide precise information
theoretic bounds on the clustering accuracy and sample complexity of learning a
mixture of two isotropic Gaussians in high dimensions under small mean
separation. If there is a sparse subset of relevant dimensions that determine
the mean separation, then the sample complexity only depends on the number of
relevant dimensions and mean separation, and can be achieved by a simple
computationally efficient procedure. Our results provide the first step of a
theoretical basis for recent methods that combine feature selection and
clustering
Block-diagonal covariance selection for high-dimensional Gaussian graphical models
Gaussian graphical models are widely utilized to infer and visualize networks
of dependencies between continuous variables. However, inferring the graph is
difficult when the sample size is small compared to the number of variables. To
reduce the number of parameters to estimate in the model, we propose a
non-asymptotic model selection procedure supported by strong theoretical
guarantees based on an oracle inequality and a minimax lower bound. The
covariance matrix of the model is approximated by a block-diagonal matrix. The
structure of this matrix is detected by thresholding the sample covariance
matrix, where the threshold is selected using the slope heuristic. Based on the
block-diagonal structure of the covariance matrix, the estimation problem is
divided into several independent problems: subsequently, the network of
dependencies between variables is inferred using the graphical lasso algorithm
in each block. The performance of the procedure is illustrated on simulated
data. An application to a real gene expression dataset with a limited sample
size is also presented: the dimension reduction allows attention to be
objectively focused on interactions among smaller subsets of genes, leading to
a more parsimonious and interpretable modular network.Comment: Accepted in JAS
Detection and Feature Selection in Sparse Mixture Models
We consider Gaussian mixture models in high dimensions and concentrate on the
twin tasks of detection and feature selection. Under sparsity assumptions on
the difference in means, we derive information bounds and establish the
performance of various procedures, including the top sparse eigenvalue of the
sample covariance matrix and other projection tests based on moments, such as
the skewness and kurtosis tests of Malkovich and Afifi (1973), and other
variants which we were better able to control under the null.Comment: 70 page
- …