298,733 research outputs found
CUR Decompositions, Similarity Matrices, and Subspace Clustering
A general framework for solving the subspace clustering problem using the CUR
decomposition is presented. The CUR decomposition provides a natural way to
construct similarity matrices for data that come from a union of unknown
subspaces . The similarity
matrices thus constructed give the exact clustering in the noise-free case.
Additionally, this decomposition gives rise to many distinct similarity
matrices from a given set of data, which allow enough flexibility to perform
accurate clustering of noisy data. We also show that two known methods for
subspace clustering can be derived from the CUR decomposition. An algorithm
based on the theoretical construction of similarity matrices is presented, and
experiments on synthetic and real data are presented to test the method.
Additionally, an adaptation of our CUR based similarity matrices is utilized
to provide a heuristic algorithm for subspace clustering; this algorithm yields
the best overall performance to date for clustering the Hopkins155 motion
segmentation dataset.Comment: Approximately 30 pages. Current version contains improved algorithm
and numerical experiments from the previous versio
A Short Survey on Data Clustering Algorithms
With rapidly increasing data, clustering algorithms are important tools for
data analytics in modern research. They have been successfully applied to a
wide range of domains; for instance, bioinformatics, speech recognition, and
financial analysis. Formally speaking, given a set of data instances, a
clustering algorithm is expected to divide the set of data instances into the
subsets which maximize the intra-subset similarity and inter-subset
dissimilarity, where a similarity measure is defined beforehand. In this work,
the state-of-the-arts clustering algorithms are reviewed from design concept to
methodology; Different clustering paradigms are discussed. Advanced clustering
algorithms are also discussed. After that, the existing clustering evaluation
metrics are reviewed. A summary with future insights is provided at the end
Twin Learning for Similarity and Clustering: A Unified Kernel Approach
Many similarity-based clustering methods work in two separate steps including
similarity matrix computation and subsequent spectral clustering. However,
similarity measurement is challenging because it is usually impacted by many
factors, e.g., the choice of similarity metric, neighborhood size, scale of
data, noise and outliers. Thus the learned similarity matrix is often not
suitable, let alone optimal, for the subsequent clustering. In addition,
nonlinear similarity often exists in many real world data which, however, has
not been effectively considered by most existing methods. To tackle these two
challenges, we propose a model to simultaneously learn cluster indicator matrix
and similarity information in kernel spaces in a principled way. We show
theoretical relationships to kernel k-means, k-means, and spectral clustering
methods. Then, to address the practical issue of how to select the most
suitable kernel for a particular clustering task, we further extend our model
with a multiple kernel learning ability. With this joint model, we can
automatically accomplish three subtasks of finding the best cluster indicator
matrix, the most accurate similarity relations and the optimal combination of
multiple kernels. By leveraging the interactions between these three subtasks
in a joint framework, each subtask can be iteratively boosted by using the
results of the others towards an overall optimal solution. Extensive
experiments are performed to demonstrate the effectiveness of our method.Comment: Published in AAAI 201
Similarity Measures for Clustering SNP and Epidemiological Data
The issue of suitable similarity measures for a joint consideration of so called SNP data and epidemiological variables arises from the GENICA (Interdisciplinary Study Group on Gene Environment Interaction and Breast Cancer in Germany) casecontrol study of sporadic breast cancer. The GENICA study aims to investigate the influence and interaction of single nucleotide polymorphic (SNP) loci and exogenous risk factors. A single nucleotide polymorphism is a point mutation that is present in at least 1 % of a population. SNPs are the most common form of human genetic variations. In particular, we consider 43 SNP loci in genes involved in the metabolism of hormones, xenobiotics and drugs as well as in the repair of DNA. Assuming that these single nucleotide changes may lead, for instance, to altered enzymes or to a reduced or enhanced amount of the original enzymes – with each alteration alone having minor effects – the aim is to detect combinations of SNPs that under certain environmental conditions increase the risk of sporadic breast cancer. The search for patterns in the present data set may be performed by a variety of clustering and classification approaches. I consider here the problem of suitable measures of proximity of two variables or subjects as an indispensable basis for a further cluster analysis. In the present data situation these measures have to be able to handle different numbers and meaning of categories of nominal scaled data as well as data of different scales. Generally, clustering approaches are a useful tool to detect structures and to generate hypothesis about potential relationships in complex data situations. Searching for patterns in the data there are two possible objectives: the identification of groups of similar objects or subjects or the identification of groups of similar variables within the whole or within subpopulations. The different objectives imply different requirements on the measures of similarity. Comparing the individual genetic profiles as well as comparing the genetic information across subpopulations I discuss possible choices of similarity measures suitable for genetic and epidemiological data, in particular, measures based on the ÷2-statistic, Flexible Matching Coefficients and combinations of similarity measures. --GENICA,single nucleotide polymorphism (SNP),sporadic breast cancer,similarity,cluster analysis,Flexible Matching Coefficient
- …